Tuesday, 13 February 2024

Role of indexing in Databases

Indexing in databases is like using a book's index to find information quickly. Instead of searching every page, you just look up a word in the index and go straight to the page you need.

 

Similarly, in a database, an index is a special data strcuture that helps to find information faster. It keeps track of the values in certain columns and where they are in the database. So, when you ask the database for specific information, it can use the index to find it quickly without searching through everything.

 

Let's delve deeper into indexing by exploring how data is stored on a disk.

In the computer world, understanding disks, tracks, and sectors is key to grasping how data is physically stored and retrieved.

 

Disk

Disks are like flat, circular plates coated with a magnetic or optical material. These plates spin fast, and data is stored in small magnetized or etched areas on their surface. Traditional hard disk drives (HDDs) use magnetic plates, while solid-state drives (SSDs) use flash memory chips.

 

Tracks

Tracks are akin to the concentric grooves on a record album. They're thin, circular bands on the disk surface where data is stored. More tracks mean more storage capacity.

 

Sector

A sector is the smallest storage unit on a hard disk or similar device. It's usually a fixed-size portion of the disk surface that can be read or written in one I/O operation. Sectors are typically 512 bytes or 4 KB in modern disk drives. They're fundamental for addressing data on a disk.

 


 

Block

Block refers to a group of adjacent sectors on a disk. Operating systems often combine sectors into these larger blocks for efficiency reasons. Reading and writing large data chunks at once is faster than accessing individual sectors. Block sizes can vary depending on the file system and disk format, but they're generally much larger than sectors (e.g., 4KB, 8KB, or more).

 

So, if you hear "block" in a disk context, consider the context to know if it means the same as a sector or a larger group of sectors. It's best to use "sector" for the smallest unit and clarify if you're talking about a group of sectors as a "block."

 

Above diagram depicts how a track, sector and blocks looks like. There are 5 track, where each track has 8 sectors.

 

 

Let's understand the significance of indexing using an example of an employee table.

 

To illustrate, let's assume a block size of 512 bytes.

 

id

name

email

city

country

age

1

krishna

krishna@demo.com

Bangalore

India

34

2

Ram

ram@demo.com

Bangalore

Hyderabad

31

 

 

 

 

 

 

 

 

 

 

 

 

 

Let's compute the size of each row in bytes:

 

1.   id: 4 bytes

2.   name: 40 bytes

3.   email: 40 bytes

4.   city: 20 bytes

5.   country: 20 bytes

6.   age: 4 bytes

 

Total row size = 128 bytes.

 

Given a block size of 512 bytes, we can accommodate 512/128 = 4 rows in a block.

 

Now, let's determine the time required to read an employee with a specific ID.

 

Considering we have 400 employee records, and each block can hold 4 employee records, all 400 employee records are distributed across 400/4 blocks = 100 blocks.

 

In the worst-case scenario, locating an employee with a given ID may necessitate scanning through all 100 blocks.

 

With indexing

Let's develop an index data structure to quickly find an employee using their ID. We'll create an index that includes two columns:

a.   id and

b.   a pointer to the record's location on the disk.

 

id

pointer_to_a_record_on_the_disk

 

 

 

 

 

 

 

 

 

 

 

For every employee record, we'll create an entry in the index. Since the index is also stored on the disk, let's determine how many blocks are needed for the index.

 

First, let's calculate the size of each index row in bytes:

 

1.   id: 4 bytes

2.   pointer: 8 bytes

So, each entry size is (4 + 8) = 12 bytes.

 

A block with a size of 512 bytes can accommodate (512/12) = 42.6 rows, which rounds down to 42 rows (using the floor value).

 

Thus, each block can contain 42 index rows. To store 400 index rows, we need (400 /42) = 9.52, which rounds up to 10 blocks (using the ceiling value).

 

In the worst-case scenario, to find an employee with a given ID, we may need to scan 10 indexing blocks (where the indexes are stored). Once we obtain the address of the record, we can read the actual data block containing the exact data. Therefore, by scanning a total of 11 blocks (10 indexing blocks + 1 data block), we can retrieve the desired record efficiently. Without indexing, we might need to scan all 100 blocks.

 

We'll delve into more detail about multi-level indexing and B-Trees in subsequent posts.


                                                  System Design Questions

No comments:

Post a Comment