Database Indexing (Part-I)

Let's explore the relevance of database indexing in a Relational Database Management System. This article aims to answer the What, How, and Why of database indexing.

What is Database Indexing?

In simple terms, database indexing is a concept that enables fast data retrieval by providing quick access to specific data within the database. This can be achieved by creating an index that either references the original data or stores a pointer to it. Indexing improves the efficiency of data retrieval, searching, and sorting operations in the database.

To better understand its purpose, let's address the problem it solves:

1.) In a database, a table is a collection of rows, also known as "records."

2.) Consider a scenario where we have a user table with fields such as id, name,
age, address, etc. Currently, no "index" is present on any column/field.

idnameageaddress
1John15Los Angles
2Mike15California
3Mayank23Delhi
4Joel18Los Santos
5Lily23Washington DC
6Marcus21California
7Diana18New York
8Chris23Washington DC

2.1) For example, let's say we want to retrieve data for all users whose age is 23.
The query would be:
SELECT id, name, age, address FROM users where age = 23

2.2.) Executing this query without indexing requires evaluating every record and
checking whether each record's age matches 23. If we assume we have 100
records and each record takes 1 second to process (for better understanding),
the complete scanning would take 100 seconds. Please keep this value in mind
for later comparison with the indexed approach.

3.) Since indexes are data structures that arrange column data (the column on which the index is created) to facilitate fast data retrieval. The index stores references to the actual data in the table. In the case of database indexing, a commonly used data structure is the "B-Tree" (we will explore its workings in another part). B-Trees enable fast insertion, deletion, and search operations.

4.) Now, let's reconsider the previous scenario where we need to retrieve data for users aged 23. This time we will have indexing on our "age" column/field. So below depicts how the data would look like in a B-Tree (an assumption to make you understand)

agerecord
15id: 1, 2
18id: 4, 7
21id: 6
23id: 3, 5, 8

4.1) The table above depicts the age column arranged in a specific order
maintained by the B-Tree. Each age value points to the records where that age is
present.

4.2) When searching for records, the time complexity for searching in a B-Tree is
O(log n). This complexity allows us to disregard cases where finding the record is
not possible. It's similar to the Binary Search algorithm, where we choose either
the left or right side depending on the condition.

4.3) Let's do some simple math. Assuming each record traversal takes 1 second
and we have 100 records without indexing, the total time taken would be O(n),
i.e. 100 seconds. However, with indexing, we eliminate half of the traversals in
each step, resulting in a time complexity of O(log n). Assuming base 10, for 100
records, we would require log(100) seconds, which equals 2 seconds. By using
proper indexing on the right columns and employing the correct query, we
optimize the query time from 100 seconds to 2 seconds. "This is the power of
indexing!".

Note: We will delve deeper into the internal workings of B-Trees and their utilization in indexing in Database Indexing (Part II).

Let's address some important questions based on the observations made so far:

Q-1) Is data stored in B-Tree (data structure which stores index column value) in sorted order?
Data stored in a B-Tree does not necessarily need to be in sorted order. When inserting data for the index_column, we aim to maintain the Balanced B-Tree property. This property guarantees that when traversing the database index_column, the data is obtained in sorted order. This characteristic greatly enhances the efficiency of searching, insertion, deletion, and other operations. The time complexity for these operations in a B-Tree is O(log n).

Q-2) The data structure that we are using will require some more time complexity and memory management to arrange index-column. So would this affect our performance also?
Yes, organizing index column data in a B-Tree does require some additional time and memory compared to storing the data in an unstructured manner. However, the benefits of using a B-Tree index, such as improved search and retrieval performance, often outweigh the overhead associated with maintaining the index structure.

Q-3) Since the example provided above, without indexing it was taking 100s (as per assumption) while with indexing it was taking 2s (as per assumption). It depicts that indexing is a very powerful concept.
-> Indexing is indeed a powerful concept, but it involves trade-offs. In the mentioned example, we observed that additional space is required for indexing, but the performance gains offered by the B-Tree structure outweigh the space requirement

-> However, there may be cases where indexing does not provide significant performance improvements.
Case 1: When the index_column and the query used are not well-suited.
For instance, if we want to retrieve data for users where salary = 30,000 and age = 20, having an index on the age column alone would not make much difference because we still need to search through all the records matching the salary condition. This problem can be resolved by using composite indexing, combining age and salary to form a unique index. Sorting the data based on age and then salary allows for fast searching and retrieval.

Case 2: Inefficient use of the index.
Let's suppose we have a "users" table with columns like id, first_name, and last_name, and we have indexed the first_name column. If we have a query like this:
SELECT id, first_name, last_name FROM users where first_name LIKE "%Mike%";

The preceding wildcard in the query ("%Mike%") makes it difficult for the database engine to effectively utilize the index, as the first_name column may have preceding characters. This leads to a full-scan lookup for data retrieval. To improve the query, we can remove the preceding wildcard since the first_name column itself indicates the absence of preceding characters. Thus, the improved query would be:
SELECT id, first_name, last_name FROM users where first_name LIKE "Mike%";

There can be various scenarios where the combination of indexing and queries may not be optimal, underscoring the importance of selecting the right columns for indexing and using appropriate queries.

Conclusion

In conclusion, we have explored the concept of database indexing in Part-I of our series. We have discussed its significance in achieving fast data retrieval and examined the benefits and considerations associated with indexing. In Part II, we will delve into the internal workings of B-Trees and cover different types of indexing and their usage. We hope you found this article informative, and we welcome any follow-up questions to foster further discussion on this topic.