Programming for beginners: Sorted Inverted Index in Apache Pinot

A Sorted Inverted Index is a specialized version of the inverted index that maintains both the inverted mapping (value to document IDs) and a sorted order of the values within each index. This index is ideal for scenarios where you want to perform quick lookups as well as efficient range queries or ordering on the indexed column.

The key difference between a standard Bitmap Inverted Index and a Sorted Inverted Index is the ordering of the values. In a Sorted Inverted Index, the document IDs are maintained in a sorted order for each indexed value, which helps to optimize range queries.

How Does a Sorted Inverted Index Work?

Let’s consider the dataset with documents that have a Category and Price columns.

Document Id	Category	Price
d1	Electronics	1000
d2	Furniture	700
d3	Electronics	1200
d4	Clothing	500

When creating a Sorted Inverted Index for the Category and Price columns, Pinot does the following:

· Category Column: Maps each unique category to a list of document IDs that contain that category. These lists are sorted based on document ID or another internal ordering.

· Price Column: Maps each unique price to a list of document IDs that contain that price. These lists are sorted based on price values, enabling efficient range queries.

Example of a Sorted Inverted Index

Let’s see how Pinot would create a Sorted Inverted Index for the Category and Price columns:

Category Column (Sorted Inverted Index):

Category	Document IDs (Sorted)
Electronics	d1, d3
Furniture	d2
Clothing	d4

In the Sorted Inverted Index, the document IDs for Electronics (d1 and d3) are already sorted in ascending order. Similarly, the other categories (Furniture and Clothing) also have their document IDs sorted.

Price Column (Sorted Inverted Index):

Price	Document IDs (Sorted)
500	d4
700	d2
1000	d1
1200	d3

Here, the document IDs are sorted according to the price values. For example, Price = 500 maps to document d4, and Price = 1200 maps to document d3, both stored in a sorted list.

Querying with Sorted Inverted Index

The Sorted Inverted Index improves query performance not only for point queries (looking for exact matches) but also for range queries and ordering operations.

Query 1: Find documents with Category = 'Electronics'

SELECT * FROM Documents WHERE Category = 'Electronics';

How Pinot Retrieves Data?

· Looks up Category = 'Electronics' in the index, which maps to d1, d3.

· Retrieves documents d1 and d3 directly.

Query 2: Find documents with Price BETWEEN 700 AND 1200

SELECT * FROM Documents WHERE Price BETWEEN 700 AND 1200;

How Pinot Retrieves Data?

· Looks up Price values in the index. The list for Price values between 700 and 1200 includes d1, d2, d3.

· Pinot quickly retrieves documents d1, d2, and d3.

· The index's sorted order makes it easy to identify which documents lie within the specified range.

Query 3: Find documents sorted by Price in descending order?

SELECT * FROM Documents ORDER BY Price DESC;

How Pinot Retrieves Data?

The index for Price is already sorted. Pinot can quickly return the documents in descending order by simply reversing the sorted list.

Advantages of Sorted Inverted Index

· Efficient Range Queries: The sorted order of values makes range queries very efficient. You don’t need to scan the entire dataset to find values that fall within a given range.

· Fast Sorting: Queries that require sorting (e.g., ORDER BY clauses) can take advantage of the sorted order in the index, eliminating the need for a full sort operation.

· Lower Disk I/O: By maintaining a sorted list of document IDs for each indexed value, Pinot can avoid performing expensive disk scans when retrieving documents.

When to Use Sorted Inverted Index

· Range Queries: Use the Sorted Inverted Index for columns that are frequently queried with range filters (e.g., BETWEEN, >=, <=).

· Ordering Queries: If you often need to retrieve data sorted by a particular column, the Sorted Inverted Index is perfect for optimizing these queries.

· High Cardinality Columns: For columns with high cardinality (many unique values), such as Price, Timestamp, etc., a Sorted Inverted Index can offer significant performance improvements.

In summary, the Sorted Inverted Index in Apache Pinot is an excellent tool for optimizing queries on columns that require fast lookups, range queries, and sorting operations. By maintaining sorted lists of document IDs for each unique value, it ensures that queries run efficiently without the need for full scans or expensive sorting operations.

Previous Next Home

Programming for beginners

Wednesday, 30 July 2025

Sorted Inverted Index in Apache Pinot

No comments:

Post a Comment