Tuesday, 29 July 2025

Understanding Forward Indexing in Apache Pinot

Apache Pinot is a real-time distributed OLAP datastore designed for low-latency analytics. Efficient indexing is a critical component that allows Pinot to provide fast query performance. Among the various indexing techniques available, forward indexing plays an important role.

This blog post helps you to understand what indexing is, what forward indexing means, and the different types of forward indexing supported in Pinot.

 

1. Introduction to Indexing

Indexing is a technique used in databases to speed up data retrieval operations. Instead of scanning the entire dataset for every query, indexes allow databases to quickly locate relevant rows, improving query performance significantly.

 

1.1 Why is Indexing Important?

·      Reduces query latency by avoiding full-table scans

·      Enhances performance for filtering, aggregations, and sorting operations

·      Helps optimize storage and memory usage

 

Example

Imagine you have a massive dataset containing 100 million customer transactions in an e-commerce platform. Each transaction record contains the following fields:

 

Transaction ID 

Customer ID

Product Category 

Amount

Timestamp

TXN001 

CUST123

Electronics

$299 

2024-03-01 10:00:00

TXN002

CUST456

Apparel

$79

2024-03-01 10:05:12

...

...

...

...

...

TXN100M

CUST789

Home Decor 

$150 

2025-03-30 15:45:30

 

1.2 Problem Without Indexing

Now, let’s say you want to run a query like:

SELECT * FROM transactions WHERE Customer_ID = 'CUST456';

Without indexing, the database must scan all 100 million records to find the matching transactions. This is called a full-table scan, which is extremely slow and inefficient, potentially taking minutes to return results.

 

1.3 How Indexing Helps here?

Indexing works like a table of contents or a dictionary lookup that helps the database to find the relevant records quickly. Suppose we create an index on Customer_ID, it will store a sorted mapping of Customer_ID to the corresponding row locations, allowing queries to be executed in milliseconds instead of minutes.

CREATE INDEX idx_customer_id ON transactions(Customer_ID);

Now, when we run the same query:

SELECT * FROM transactions WHERE Customer_ID = 'CUST456';

Instead of scanning all 100 million records, the database directly looks up CUST456 in the index and retrieves the relevant rows instantly.

 

2. Forward Indexing

 

2.1 What is Forward Indexing in Pinot?

In Apache Pinot, a Forward Index is the default indexing mechanism that stores column values row-wise. This allows Pinot to efficiently retrieve entire rows during query execution.

 

Apache Pinot supports three types of forward indexes:

 

·      Dictionary-Encoded Forward Index: Assigns a unique ID to each distinct column value, creating a dictionary that maps IDs to actual values. The forward index stores bit-compressed IDs instead of raw values, reducing storage and improving query performance.

 

·      Sorted Forward Index: Uses dictionary encoding but also maintains a sorted order of values. Each unique value maps to a range of document IDs (start and end positions), optimizing lookups and scans.

 

·      Raw Value Forward Index: Stores actual column values directly without encoding or compression. This approach is useful when dictionary encoding isn’t efficient, such as for high-cardinality columns.

 

2.2 Dictionary-Encoded Forward Index in Apache Pinot

A Dictionary-Encoded Forward Index is a storage optimization technique where each unique value in a column is assigned a unique ID, and instead of storing the actual values, only the encoded IDs are stored.

 

This technique reduces storage costs and improves query performance by using integer-based lookups instead of scanning large text or numeric values directly.

 

How Dictionary Encoding Works (Step-by-Step Explanation)?

Let’s take an example where we have a Product Table in Pinot:

 

Row ID 

Product Name 

Category

1

Laptop

Electronics

2

Smartphone 

Electronics

3

Keyboard 

Accessories

4

Laptop 

Electronics

5

Smartphone 

Electronics

 

Step 1: Create a Dictionary for Each Column

Pinot assigns a unique dictionary ID for each unique value in a column:

 

Dictionary for Product Name

Dictionary ID

Product Name

0

Laptop

1

Smartphone

2

Keyboard

 

Dictionary for Category

Dictionary ID

Category

0

Electronics

1

Accessories

 

Step 2: Replace Column Values with Encoded IDs

Instead of storing actual values, Pinot stores only the dictionary IDs:

 

Row Id

Product Name (Dictionary ID)   

Category (Dictionary ID)

1

0

0

2

1

0

3

2

1

4

0

0

5

1

0

 

Now, instead of scanning long text values like "Laptop", Pinot only scans small integers (0, 1, 2), which are much faster to process and take less storage.

 

Advantages of Dictionary-Encoded Forward Index

·      Storage Efficiency: Text and high-cardinality numeric values are stored as small integer IDs, reducing disk space usage.

·      Faster Query Execution: Since numbers are smaller than text, scans, filtering, and lookups are much faster.

·      Optimized Filtering (WHERE conditions): Pinot can quickly scan integer values instead of large text, making filtering operations (WHERE Product = 'Laptop') more efficient.

·      Improved Compression: Since dictionary IDs have fewer unique values, they are highly compressible, further reducing storage costs.

 

Disadvantages (When Not to Use It?)

·      Not Ideal for High-Cardinality Data: If a column has too many unique values (e.g., timestamps, UUIDs, user IDs), the dictionary grows large and loses its storage advantage.

·      Extra Lookup Step: When querying, Pinot must first look up the dictionary to get actual values, adding a slight overhead.

 

Example Query Execution in Pinot

Let’s say we want to query:

SELECT COUNT(*) 
FROM ProductTable 
WHERE Product_Name = 'Laptop';

How Pinot Processes This Query?

·      Instead of scanning "Laptop" directly, Pinot looks up the dictionary ID (0).

·      Pinot scans the compressed forward index for occurrences of 0 in the Product Name column.

·      It returns the count faster than scanning the raw text values.

 

2.3 Sorted Forward Index

A Sorted Forward Index is an optimization technique in Apache Pinot where a column is stored in a pre-sorted order, allowing for faster lookups, range queries, and filtering operations.

 

Unlike a standard dictionary-encoded forward index, which assigns unique IDs but does not impose any order, a sorted forward index keeps dictionary-encoded values in sorted order, enabling binary search and efficient range scans.

 

How Sorted Forward Index Works (Step-by-Step Explanation)?

Let’s take an example of a Transaction Table in Pinot:

 

Transaction ID 

Customer ID

Amount

T1

C101

1000

T2

C102

700

T3

C101

50

T4

C103

1200

T5

C102

500

If we apply Sorted Forward Index on the Customer ID column, Pinot will store Customer IDs in sorted order instead of how they were inserted.

 

Step 1: Sort the Customer ID Column

Pinot sorts the values in the Customer ID column and remembers their positions.

 

Original Order

Sorted Order (Sorted in Index)

C101

C101 (Position 0)

C102

C101 (Position 1)

C101

C102 (Position 2)

C103

C102 (Position 3)

C102

C103 (Position 4)

 

Stored like below.

Customer Id

Start Position

End Position

C101

0

1

C102

2

3

C103

4

4

 

Now, when Pinot searches for C101, it can quickly find the start and end positions instead of scanning everything.

 

Step 2: Faster Queries Using Sorted Index

 

Let's say we run the query:

SELECT COUNT(*) FROM Transactions WHERE CustomerID = 'C101';

 

How Pinot Uses Sorted Forward Index?

It knows C101 starts at position 0 and ends at position 1 (because it's sorted). Instead of scanning all rows, it directly jumps to those positions. This makes lookups much faster!

 

Advantages Of Sorted Forward Index(When to Use It?)

·      Fast filtering (WHERE CustomerID = 'C101')

·      Great for range queries (WHERE Amount BETWEEN 500 AND 1000)

·      Less storage needed compared to raw indexes

 

Disadvantages (When Not to Use It?)

·      Not good for high-cardinality columns (e.g., timestamps)

·      Expensive updates since sorting needs to be maintained

 

 

2.4 Raw Value Forward Index

A Raw Value Forward Index is the simplest type of forward index in Apache Pinot. It stores the actual column values as they were ingested without any transformation, sorting, or encoding.

 

Unlike dictionary-encoded or sorted forward indexes, a raw forward index keeps every value in its original row order, making it ideal for high-cardinality columns.

 

Let's say we have this Transactions Table:

Transaction ID

Customer ID   

Amount

T1

C101

1000

T2

C103

700

T3

C101

1200

T4

C102

500

 

With a Raw Value Forward Index, Pinot stores the column’s values exactly as they were inserted.

 

How Data Is Stored (Raw Format)

Customer ID Column Storage (Row-Wise)

 

Position 0: C101

Position 1: C103

Position 2: C101

Position 3: C102

 

Amount Column Storage (Row-Wise)

 

Position 0: 1000

Position 1: 700

Position 2: 1200

Position 3: 500

 

Each column’s values are stored as they are, keeping their original row order. There’s no sorting or dictionary compression applied.

 

How Queries Work with Raw Forward Index?

 

Query 1: Find all transactions for CustomerID = 'C101'

SELECT * FROM Transactions WHERE CustomerID = 'C101';

 

How Pinot Retrieves Data?

·      It scans the entire "Customer ID" column.

·      Finds C101 at positions 0 and 2.

·      Fetches the corresponding rows.

 

In this case, Full column scan is required, which is slower compared to Sorted Forward Index.

 

Query 2: Find the total sales amount

SELECT SUM(Amount) FROM Transactions;

 

How Pinot Retrieves Data?

·      Reads the "Amount" column directly.

·      Adds up 1000 + 700 + 1200 + 500.

·      Returns the result.

 

Since values are stored as-is, Pinot can read them without extra lookup steps (unlike dictionary encoding). This makes aggregations faster.

 

Use when:

·      The column has high-cardinality values (e.g., timestamps, user IDs).

·      You perform many aggregations (SUM, AVG, MAX, MIN).

 

Avoid when:

·      You frequently filter or search on the column (WHERE CustomerID = 'C101'), because full scans are needed.

·      The column has low-cardinality values, where dictionary-encoded or sorted forward indexes are more efficient.

 

In summary,

·      If you need efficient filtering on a low-cardinality column, go with Dictionary-Encoded or Sorted Forward Index.

·      If you frequently aggregate data or work with high-cardinality columns, choose Raw Value Forward Index.

·      Sorted Forward Index is used for Fast filtering, range queries. Avoid this for High-cardinality columns.

 

 

 

Previous                                                    Next                                                    Home

No comments:

Post a Comment