Sunday, 3 August 2025

Optimizing Aggregation Queries with Star-Tree Index in Apache Pinot

Star-tree index is a hybrid indexing technique in Apache Pinot that optimizes query performance for aggregated queries on large datasets. It pre-aggregates data at various levels and stores it in a tree structure, enabling fast lookups without scanning the entire dataset. 

Let me explain it with an example.

 

1. Consider the following retail_sales table:

 

order_id

region

category

sub_category

product

sales_amount

101

US

Electronics

Mobile

iPhone 15

1200

102

US

Electronics

Mobile

Samsung S23

1000

103

US

Electronics

Laptop

MacBook Pro

2500

104

US

Electronics

Laptop

Dell XPS

1800

105

EU

Electronics

Mobile

iPhone 14

1100

106

EU

Electronics

Laptop

Lenovo ThinkPad

1600

107

EU

Clothing

Shoes

Nike AirMax

200

108

EU

Clothing

Shoes

Adidas Ultra

220

109

APAC

Clothing

T-Shirts

Uniqlo Tee

50

110

APAC

Electronics

Mobile

OnePlus 11

900

 

2. Configuring Star-Tree Index with 3 Levels

{
  "tableName": "retail_sales",
  "tableType": "OFFLINE",
  "tableIndexConfig": {
    "starTreeIndexConfigs": [
      {
        "dimensionsSplitOrder": ["region", "category", "sub_category"],
        "skipStarNodeCreationForDimensions": ["product"],
        "functionColumnPairs": ["SUM(sales_amount)"]
      }
    ]
  }
}

Now, aggregations (here I set it as sum of sales_amount) are precomputed at each level (region, category, sub_category).

 

3. Understanding SUM(sales_amount) at Each Node

The Star-Tree Index generates precomputed sums as follows:

 

3.1 Root Node (Total Sales): At the top level, Pinot precomputes the total sales amount for the entire dataset:

 

SUM(sales_amount) = 1200 + 1000 + 2500 + 1800 + 1100 + 1600 + 200 + 220 + 50 + 900 = 10570

 

3.2 Region-Level Aggregation: At the Region Level, Pinot stores:

 

·      US Sales: 1200 + 1000 + 2500 + 1800 = 6500

·      EU Sales: 1100 + 1600 + 200 + 220 = 3120

·      APAC Sales: 50 + 900 = 950

 

 


3.3 Category-Level Aggregation within Each Region: Within each region, the system further precomputes sums at the category level:

 

US Region

category 

SUM(sales_amount)

Electronics

1200 + 1000 + 2500 + 1800 = 6500

 

EU Region

category

SUM(sales_amount)

Electronics

1100 + 1600 = 2700

Clothing 

200 + 220 = 420

 

APAC Region

category 

SUM(sales_amount)

Electronics

900

Clothing 

50

  


 

Sub-Category-Level Aggregation within Each Category

At the Sub-Category Level, Pinot further precomputes sums:

 

3.4 Sub-Category-Level Aggregation within Each Category

At the Sub-Category Level, Pinot further precomputes sums:

                                            

US & Electronics

sub_category 

SUM(sales_amount)

Mobile 

1200 + 1000 = 2200

Laptop 

2500 + 1800 = 4300

 

EU & Electronics

sub_category 

SUM(sales_amount)

Mobile 

1100

Laptop 

1600

 

 

EU& Clothing

sub_category 

SUM(sales_amount)

Shoes

200 + 220 = 420

 

 

APAC & Electronics

sub_category 

SUM(sales_amount)

Mobile 

900

 

APAC & Clothing

sub_category

SUM(sales_amount)

T-Shirts 

50

 

Final tree looks like below.

 


The index now contains precomputed SUM values at different levels:

Root
 ├── US (6500)
 │    ├── Electronics (6500)
 │    │    ├── Mobile (2200)
 │    │    ├── Laptop (4300)
 ├── EU (3120)
 │    ├── Electronics (2700)
 │    │    ├── Mobile (1100)
 │    │    ├── Laptop (1600)
 │    ├── Clothing (420)
 │    │    ├── Shoes (420)
 ├── APAC (950)
 │    ├── Electronics (900)
 │    │    ├── Mobile (900)
 │    ├── Clothing (50)
 │    │    ├── T-Shirts (50)

Now, if we run a query like below:

SELECT region, category, sub_category, SUM(sales_amount)
FROM retail_sales
GROUP BY region, category, sub_category;

Instead of scanning the entire dataset, Pinot retrieves precomputed aggregates from the Star-Tree Index, making the query nearly instant.

 

When to Use Star-Tree Index?

·      When queries involve aggregations (SUM, AVG, COUNT, etc.).

·      When working with high-cardinality dimensions (e.g., millions of users or products).

·      When optimizing for low-latency analytical queries.

 

It is not useful for raw data lookups or non-aggregated queries.

 

In summary, Apache Pinot’s Star-Tree Index is a powerful technique for speeding up aggregation queries. By precomputing results and organizing data in a hierarchical tree structure, it significantly reduces query time, making it ideal for real-time analytics on large-scale datasets.

 

 

Previous                                                    Next                                                    Home

No comments:

Post a Comment