Wednesday, 14 February 2024

Multi-level indexing in Databases

Multi-level indexing, often referred to as B-tree indexing, is a hierarchical data structure used in databases for efficiently finding particular records. This approach is invaluable for extensive datasets where the usual single-level indexing is too slow. Consider the challenge of looking up a certain name in a phone book containing millions of names. Going through each name one by one would take forever. Multi-level indexing serves as a navigational tool, leading you promptly to the section you need.

 

Let's delve deeper into indexing by exploring how data is stored on a disk.

In the computer world, understanding disks, tracks, and sectors is key to grasping how data is physically stored and retrieved.

 

Disk

Disks are like flat, circular plates coated with a magnetic or optical material. These plates spin fast, and data is stored in small magnetized or etched areas on their surface. Traditional hard disk drives (HDDs) use magnetic plates, while solid-state drives (SSDs) use flash memory chips.

 

Tracks

Tracks are akin to the concentric grooves on a record album. They're thin, circular bands on the disk surface where data is stored. More tracks mean more storage capacity.

 

Sector

A sector is the smallest storage unit on a hard disk or similar device. It's usually a fixed-size portion of the disk surface that can be read or written in one I/O operation. Sectors are typically 512 bytes or 4 KB in modern disk drives. They're fundamental for addressing data on a disk.

 

 


Block

Block refers to a group of adjacent sectors on a disk. Operating systems often combine sectors into these larger blocks for efficiency reasons. Reading and writing large data chunks at once is faster than accessing individual sectors. Block sizes can vary depending on the file system and disk format, but they're generally much larger than sectors (e.g., 4KB, 8KB, or more).

 

So, if you hear "block" in a disk context, consider the context to know if it means the same as a sector or a larger group of sectors. It's best to use "sector" for the smallest unit and clarify if you're talking about a group of sectors as a "block."

 

Above diagram depicts how a track, sector and blocks looks like. There are 5 track, where each track has 8 sectors.

 

 

Let's understand the significance of indexing using an example of an employee table.

 

To illustrate, let's assume a block size of 512 bytes.

 

id

name

email

city

country

age

1

krishna

krishna@demo.com

Bangalore

India

34

2

Ram

ram@demo.com

Bangalore

Hyderabad

31

 

 

 

 

 

 

 

 

 

 

 

 

 

Let's compute the size of each row in bytes:

 

1.   id: 4 bytes

2.   name: 40 bytes

3.   email: 40 bytes

4.   city: 20 bytes

5.   country: 20 bytes

6.   age: 4 bytes

 

Total row size = 128 bytes.

 

Given a block size of 512 bytes, we can accommodate 512/128 = 4 rows in a block.

 

Now, let's determine the time required to read an employee with a specific ID.

 

Considering we have 400 employee records, and each block can hold 4 employee records, all 400 employee records are distributed across 400/4 blocks = 100 blocks.

 

In the worst-case scenario, locating an employee with a given ID may necessitate scanning through all 100 blocks.

 

With indexing

Let's develop an index data structure to quickly find an employee using their ID. We'll create an index that includes two columns:

a.   id and

b.   a pointer to the record's location on the disk.

 

id

pointer_to_a_record_on_the_disk

 

 

 

 

 

 

 

 

 

 

 

For every employee record, we'll create an entry in the index. Since the index is also stored on the disk, let's determine how many blocks are needed for the index.

 

First, let's calculate the size of each index row in bytes:

 

1.   id: 4 bytes

2.   pointer: 8 bytes

So, each entry size is (4 + 8) = 12 bytes.

 

A block with a size of 512 bytes can accommodate (512/12) = 42.6 rows, which rounds down to 42 rows (using the floor value).

 

Thus, each block can contain 42 index rows. To store 400 index rows, we need (400 /42) = 9.52, which rounds up to 10 blocks (using the ceiling value).

 

In the worst-case scenario, to find an employee with a given ID, we may need to scan 10 indexing blocks (where the indexes are stored). Once we obtain the address of the record, we can read the actual data block containing the exact data. Therefore, by scanning a total of 11 blocks (10 indexing blocks + 1 data block), we can retrieve the desired record efficiently. Without indexing, we might need to scan all 100 blocks.

 

 


As depicted in the image above, the first index block accommodates addresses for employees with IDs ranging from 1 to 42. Similarly, the second index block can store addresses for employees with IDs from 43 to 84, and so forth.

 

However, as the number of records in the employee table grows, the number of index blocks will also rise, as illustrated in the table below.

 

a.   Each data block can contain 4 employee records.

b.   Each index block can hold 42 index rows.

 

Total Records

Data blocks

Index blocks

400

400 /4 = 100

400/42 = 9.52

4000

4000/4 = 1000

4000/42 = 95.23

40000

10000

952.38

400000

100000

9523.8

 

As the volume of data increases, the number of first-level indexing blocks will increase, resulting in longer search times. This is where second-level indexing becomes relevant.

 

The second-level index is the index above the first-level index.




In our scenario, each index block can accommodate up to 42 entries. The second-level index maintains a single pointer to these 42 entries within the first-level index.

 

The starting key for the second-level index is 1, indicating its pointer to the first block in the first-level index. Subsequently, the next key of the second-level index is 43, pointing to the second block of the second-level index, and so forth. This setup is known as a sparse index, where one entry is maintained for each index block.

 

Given that a second-level index block can hold up to 42 entries, one block in the second-level index can effectively manage 42 * 42 * 42 = 74088 employee records.

 

With 10 index blocks in the second-level index, they can collectively handle 10 * 74088 = 740880

 

 

How does the search happen?

Let's say you're looking for an employee with record 129. First, you'll search the second-level index blocks. Once you determine which first-level index block to query, you'll access that specific first-level index block and search within it to retrieve the address of the employee record corresponding to the ID 129. This process is illustrated in the image below.

 


In this example, the total number of blocks retrieved is:

 

a.   1 block from the second-level index.

b.   1 block from the first-level index.

c.    1 block from the actual table data.

 

So, by accessing 3 blocks in total, we can obtain the employee with the given ID. Incorporating multi-level indexes helps reduce the number of blocks accessed.

 

As the number of index blocks increases in the second level index, we'll introduce a third level of indexing on top of the second level, and continue this pattern as needed. This concept is the foundation of B and B+ trees. If you visualize the multi-level index image from a different angle, you'll notice it resembles a tree-like structure.




Finally B-tree is a self-managed multi-level indexing data structure. Will talk about this more in subsequent posts.


                                                  System Design Questions

No comments:

Post a Comment