Programming for beginners: Optimizing Aggregation Queries with Star-Tree Index in Apache Pinot

Star-tree index is a hybrid indexing technique in Apache Pinot that optimizes query performance for aggregated queries on large datasets. It pre-aggregates data at various levels and stores it in a tree structure, enabling fast lookups without scanning the entire dataset.

Let me explain it with an example.

1. Consider the following retail_sales table:

order_id	region	category	sub_category	product	sales_amount
101	US	Electronics	Mobile	iPhone 15	1200
102	US	Electronics	Mobile	Samsung S23	1000
103	US	Electronics	Laptop	MacBook Pro	2500
104	US	Electronics	Laptop	Dell XPS	1800
105	EU	Electronics	Mobile	iPhone 14	1100
106	EU	Electronics	Laptop	Lenovo ThinkPad	1600
107	EU	Clothing	Shoes	Nike AirMax	200
108	EU	Clothing	Shoes	Adidas Ultra	220
109	APAC	Clothing	T-Shirts	Uniqlo Tee	50
110	APAC	Electronics	Mobile	OnePlus 11	900

2. Configuring Star-Tree Index with 3 Levels

{
  "tableName": "retail_sales",
  "tableType": "OFFLINE",
  "tableIndexConfig": {
    "starTreeIndexConfigs": [
      {
        "dimensionsSplitOrder": ["region", "category", "sub_category"],
        "skipStarNodeCreationForDimensions": ["product"],
        "functionColumnPairs": ["SUM(sales_amount)"]
      }
    ]
  }
}

Now, aggregations (here I set it as sum of sales_amount) are precomputed at each level (region, category, sub_category).

3. Understanding SUM(sales_amount) at Each Node

The Star-Tree Index generates precomputed sums as follows:

3.1 Root Node (Total Sales): At the top level, Pinot precomputes the total sales amount for the entire dataset:

SUM(sales_amount) = 1200 + 1000 + 2500 + 1800 + 1100 + 1600 + 200 + 220 + 50 + 900 = 10570

3.2 Region-Level Aggregation: At the Region Level, Pinot stores:

· US Sales: 1200 + 1000 + 2500 + 1800 = 6500

· EU Sales: 1100 + 1600 + 200 + 220 = 3120

· APAC Sales: 50 + 900 = 950

3.3 Category-Level Aggregation within Each Region: Within each region, the system further precomputes sums at the category level:

US Region

category	SUM(sales_amount)
Electronics	1200 + 1000 + 2500 + 1800 = 6500

EU Region

category	SUM(sales_amount)
Electronics	1100 + 1600 = 2700
Clothing	200 + 220 = 420

APAC Region

category	SUM(sales_amount)
Electronics	900
Clothing	50

Sub-Category-Level Aggregation within Each Category

At the Sub-Category Level, Pinot further precomputes sums:

3.4 Sub-Category-Level Aggregation within Each Category

At the Sub-Category Level, Pinot further precomputes sums:

US & Electronics

sub_category	SUM(sales_amount)
Mobile	1200 + 1000 = 2200
Laptop	2500 + 1800 = 4300

EU & Electronics

sub_category	SUM(sales_amount)
Mobile	1100
Laptop	1600

EU& Clothing

sub_category	SUM(sales_amount)
Shoes	200 + 220 = 420

APAC & Electronics

sub_category	SUM(sales_amount)
Mobile	900

APAC & Clothing

sub_category	SUM(sales_amount)
T-Shirts	50

Final tree looks like below.

The index now contains precomputed SUM values at different levels:

Root
 ├── US (6500)
 │    ├── Electronics (6500)
 │    │    ├── Mobile (2200)
 │    │    ├── Laptop (4300)
 ├── EU (3120)
 │    ├── Electronics (2700)
 │    │    ├── Mobile (1100)
 │    │    ├── Laptop (1600)
 │    ├── Clothing (420)
 │    │    ├── Shoes (420)
 ├── APAC (950)
 │    ├── Electronics (900)
 │    │    ├── Mobile (900)
 │    ├── Clothing (50)
 │    │    ├── T-Shirts (50)

Now, if we run a query like below:

SELECT region, category, sub_category, SUM(sales_amount)
FROM retail_sales
GROUP BY region, category, sub_category;

Instead of scanning the entire dataset, Pinot retrieves precomputed aggregates from the Star-Tree Index, making the query nearly instant.

When to Use Star-Tree Index?

· When queries involve aggregations (SUM, AVG, COUNT, etc.).

· When working with high-cardinality dimensions (e.g., millions of users or products).

· When optimizing for low-latency analytical queries.

It is not useful for raw data lookups or non-aggregated queries.

In summary, Apache Pinot’s Star-Tree Index is a powerful technique for speeding up aggregation queries. By precomputing results and organizing data in a hierarchical tree structure, it significantly reduces query time, making it ideal for real-time analytics on large-scale datasets.

Previous Next Home

Programming for beginners

Sunday, 3 August 2025

Optimizing Aggregation Queries with Star-Tree Index in Apache Pinot

No comments:

Post a Comment