Apache Pinot is a real-time distributed OLAP datastore designed for low-latency analytics. Efficient indexing is a critical component that allows Pinot to provide fast query performance. Among the various indexing techniques available, forward indexing plays an important role.
This blog post helps you to understand what indexing is, what forward indexing means, and the different types of forward indexing supported in Pinot.
1. Introduction to Indexing
Indexing is a technique used in databases to speed up data retrieval operations. Instead of scanning the entire dataset for every query, indexes allow databases to quickly locate relevant rows, improving query performance significantly.
1.1 Why is Indexing Important?
· Reduces query latency by avoiding full-table scans
· Enhances performance for filtering, aggregations, and sorting operations
· Helps optimize storage and memory usage
Example
Imagine you have a massive dataset containing 100 million customer transactions in an e-commerce platform. Each transaction record contains the following fields:
Transaction ID |
Customer ID |
Product Category |
Amount |
Timestamp |
TXN001 |
CUST123 |
Electronics |
$299 |
2024-03-01 10:00:00 |
TXN002 |
CUST456 |
Apparel |
$79 |
2024-03-01 10:05:12 |
... |
... |
... |
... |
... |
TXN100M |
CUST789 |
Home Decor |
$150 |
2025-03-30 15:45:30 |
1.2 Problem Without Indexing
Now, let’s say you want to run a query like:
SELECT * FROM transactions WHERE Customer_ID = 'CUST456';
Without indexing, the database must scan all 100 million records to find the matching transactions. This is called a full-table scan, which is extremely slow and inefficient, potentially taking minutes to return results.
1.3 How Indexing Helps here?
Indexing works like a table of contents or a dictionary lookup that helps the database to find the relevant records quickly. Suppose we create an index on Customer_ID, it will store a sorted mapping of Customer_ID to the corresponding row locations, allowing queries to be executed in milliseconds instead of minutes.
CREATE INDEX idx_customer_id ON transactions(Customer_ID);
Now, when we run the same query:
SELECT * FROM transactions WHERE Customer_ID = 'CUST456';
Instead of scanning all 100 million records, the database directly looks up CUST456 in the index and retrieves the relevant rows instantly.
2. Forward Indexing
2.1 What is Forward Indexing in Pinot?
In Apache Pinot, a Forward Index is the default indexing mechanism that stores column values row-wise. This allows Pinot to efficiently retrieve entire rows during query execution.
Apache Pinot supports three types of forward indexes:
· Dictionary-Encoded Forward Index: Assigns a unique ID to each distinct column value, creating a dictionary that maps IDs to actual values. The forward index stores bit-compressed IDs instead of raw values, reducing storage and improving query performance.
· Sorted Forward Index: Uses dictionary encoding but also maintains a sorted order of values. Each unique value maps to a range of document IDs (start and end positions), optimizing lookups and scans.
· Raw Value Forward Index: Stores actual column values directly without encoding or compression. This approach is useful when dictionary encoding isn’t efficient, such as for high-cardinality columns.
2.2 Dictionary-Encoded Forward Index in Apache Pinot
A Dictionary-Encoded Forward Index is a storage optimization technique where each unique value in a column is assigned a unique ID, and instead of storing the actual values, only the encoded IDs are stored.
This technique reduces storage costs and improves query performance by using integer-based lookups instead of scanning large text or numeric values directly.
How Dictionary Encoding Works (Step-by-Step Explanation)?
Let’s take an example where we have a Product Table in Pinot:
Row ID |
Product Name |
Category |
1 |
Laptop |
Electronics |
2 |
Smartphone |
Electronics |
3 |
Keyboard |
Accessories |
4 |
Laptop |
Electronics |
5 |
Smartphone |
Electronics |
Step 1: Create a Dictionary for Each Column
Pinot assigns a unique dictionary ID for each unique value in a column:
Dictionary for Product Name
Dictionary ID |
Product Name |
0 |
Laptop |
1 |
Smartphone |
2 |
Keyboard |
Dictionary for Category
Dictionary ID |
Category |
0 |
Electronics |
1 |
Accessories |
Step 2: Replace Column Values with Encoded IDs
Instead of storing actual values, Pinot stores only the dictionary IDs:
Row Id |
Product Name (Dictionary ID) |
Category (Dictionary ID) |
1 |
0 |
0 |
2 |
1 |
0 |
3 |
2 |
1 |
4 |
0 |
0 |
5 |
1 |
0 |
Now, instead of scanning long text values like "Laptop", Pinot only scans small integers (0, 1, 2), which are much faster to process and take less storage.
Advantages of Dictionary-Encoded Forward Index
· Storage Efficiency: Text and high-cardinality numeric values are stored as small integer IDs, reducing disk space usage.
· Faster Query Execution: Since numbers are smaller than text, scans, filtering, and lookups are much faster.
· Optimized Filtering (WHERE conditions): Pinot can quickly scan integer values instead of large text, making filtering operations (WHERE Product = 'Laptop') more efficient.
· Improved Compression: Since dictionary IDs have fewer unique values, they are highly compressible, further reducing storage costs.
Disadvantages (When Not to Use It?)
· Not Ideal for High-Cardinality Data: If a column has too many unique values (e.g., timestamps, UUIDs, user IDs), the dictionary grows large and loses its storage advantage.
· Extra Lookup Step: When querying, Pinot must first look up the dictionary to get actual values, adding a slight overhead.
Example Query Execution in Pinot
Let’s say we want to query:
SELECT COUNT(*) FROM ProductTable WHERE Product_Name = 'Laptop';
How Pinot Processes This Query?
· Instead of scanning "Laptop" directly, Pinot looks up the dictionary ID (0).
· Pinot scans the compressed forward index for occurrences of 0 in the Product Name column.
· It returns the count faster than scanning the raw text values.
2.3 Sorted Forward Index
A Sorted Forward Index is an optimization technique in Apache Pinot where a column is stored in a pre-sorted order, allowing for faster lookups, range queries, and filtering operations.
Unlike a standard dictionary-encoded forward index, which assigns unique IDs but does not impose any order, a sorted forward index keeps dictionary-encoded values in sorted order, enabling binary search and efficient range scans.
How Sorted Forward Index Works (Step-by-Step Explanation)?
Let’s take an example of a Transaction Table in Pinot:
Transaction ID |
Customer ID |
Amount |
T1 |
C101 |
1000 |
T2 |
C102 |
700 |
T3 |
C101 |
50 |
T4 |
C103 |
1200 |
T5 |
C102 |
500 |
If we apply Sorted Forward Index on the Customer ID column, Pinot will store Customer IDs in sorted order instead of how they were inserted.
Step 1: Sort the Customer ID Column
Pinot sorts the values in the Customer ID column and remembers their positions.
Original Order |
Sorted Order (Sorted in Index) |
C101 |
C101 (Position 0) |
C102 |
C101 (Position 1) |
C101 |
C102 (Position 2) |
C103 |
C102 (Position 3) |
C102 |
C103 (Position 4) |
Stored like below.
Customer Id |
Start Position |
End Position |
C101 |
0 |
1 |
C102 |
2 |
3 |
C103 |
4 |
4 |
Now, when Pinot searches for C101, it can quickly find the start and end positions instead of scanning everything.
Step 2: Faster Queries Using Sorted Index
Let's say we run the query:
SELECT COUNT(*) FROM Transactions WHERE CustomerID = 'C101';
How Pinot Uses Sorted Forward Index?
It knows C101 starts at position 0 and ends at position 1 (because it's sorted). Instead of scanning all rows, it directly jumps to those positions. This makes lookups much faster!
Advantages Of Sorted Forward Index(When to Use It?)
· Fast filtering (WHERE CustomerID = 'C101')
· Great for range queries (WHERE Amount BETWEEN 500 AND 1000)
· Less storage needed compared to raw indexes
Disadvantages (When Not to Use It?)
· Not good for high-cardinality columns (e.g., timestamps)
· Expensive updates since sorting needs to be maintained
2.4 Raw Value Forward Index
A Raw Value Forward Index is the simplest type of forward index in Apache Pinot. It stores the actual column values as they were ingested without any transformation, sorting, or encoding.
Unlike dictionary-encoded or sorted forward indexes, a raw forward index keeps every value in its original row order, making it ideal for high-cardinality columns.
Let's say we have this Transactions Table:
Transaction ID |
Customer ID |
Amount |
T1 |
C101 |
1000 |
T2 |
C103 |
700 |
T3 |
C101 |
1200 |
T4 |
C102 |
500 |
With a Raw Value Forward Index, Pinot stores the column’s values exactly as they were inserted.
How Data Is Stored (Raw Format)
Customer ID Column Storage (Row-Wise)
Position 0: C101
Position 1: C103
Position 2: C101
Position 3: C102
Amount Column Storage (Row-Wise)
Position 0: 1000
Position 1: 700
Position 2: 1200
Position 3: 500
Each column’s values are stored as they are, keeping their original row order. There’s no sorting or dictionary compression applied.
How Queries Work with Raw Forward Index?
Query 1: Find all transactions for CustomerID = 'C101'
SELECT * FROM Transactions WHERE CustomerID = 'C101';
How Pinot Retrieves Data?
· It scans the entire "Customer ID" column.
· Finds C101 at positions 0 and 2.
· Fetches the corresponding rows.
In this case, Full column scan is required, which is slower compared to Sorted Forward Index.
Query 2: Find the total sales amount
SELECT SUM(Amount) FROM Transactions;
How Pinot Retrieves Data?
· Reads the "Amount" column directly.
· Adds up 1000 + 700 + 1200 + 500.
· Returns the result.
Since values are stored as-is, Pinot can read them without extra lookup steps (unlike dictionary encoding). This makes aggregations faster.
Use when:
· The column has high-cardinality values (e.g., timestamps, user IDs).
· You perform many aggregations (SUM, AVG, MAX, MIN).
Avoid when:
· You frequently filter or search on the column (WHERE CustomerID = 'C101'), because full scans are needed.
· The column has low-cardinality values, where dictionary-encoded or sorted forward indexes are more efficient.
In summary,
· If you need efficient filtering on a low-cardinality column, go with Dictionary-Encoded or Sorted Forward Index.
· If you frequently aggregate data or work with high-cardinality columns, choose Raw Value Forward Index.
· Sorted Forward Index is used for Fast filtering, range queries. Avoid this for High-cardinality columns.
Previous Next Home
No comments:
Post a Comment