Star-tree index is a hybrid indexing technique in Apache Pinot that optimizes query performance for aggregated queries on large datasets. It pre-aggregates data at various levels and stores it in a tree structure, enabling fast lookups without scanning the entire dataset.
Let me explain it with an example.
1. Consider the following retail_sales table:
|
order_id |
region |
category |
sub_category |
product |
sales_amount |
|
101 |
US |
Electronics |
Mobile |
iPhone 15 |
1200 |
|
102 |
US |
Electronics |
Mobile |
Samsung S23 |
1000 |
|
103 |
US |
Electronics |
Laptop |
MacBook Pro |
2500 |
|
104 |
US |
Electronics |
Laptop |
Dell XPS |
1800 |
|
105 |
EU |
Electronics |
Mobile |
iPhone 14 |
1100 |
|
106 |
EU |
Electronics |
Laptop |
Lenovo ThinkPad |
1600 |
|
107 |
EU |
Clothing |
Shoes |
Nike AirMax |
200 |
|
108 |
EU |
Clothing |
Shoes |
Adidas Ultra |
220 |
|
109 |
APAC |
Clothing |
T-Shirts |
Uniqlo Tee |
50 |
|
110 |
APAC |
Electronics |
Mobile |
OnePlus 11 |
900 |
2. Configuring Star-Tree Index with 3 Levels
{
"tableName": "retail_sales",
"tableType": "OFFLINE",
"tableIndexConfig": {
"starTreeIndexConfigs": [
{
"dimensionsSplitOrder": ["region", "category", "sub_category"],
"skipStarNodeCreationForDimensions": ["product"],
"functionColumnPairs": ["SUM(sales_amount)"]
}
]
}
}
Now, aggregations (here I set it as sum of sales_amount) are precomputed at each level (region, category, sub_category).
3. Understanding SUM(sales_amount) at Each Node
The Star-Tree Index generates precomputed sums as follows:
3.1 Root Node (Total Sales): At the top level, Pinot precomputes the total sales amount for the entire dataset:
SUM(sales_amount) = 1200 + 1000 + 2500 + 1800 + 1100 + 1600 + 200 + 220 + 50 + 900 = 10570
3.2 Region-Level Aggregation: At the Region Level, Pinot stores:
· US Sales: 1200 + 1000 + 2500 + 1800 = 6500
· EU Sales: 1100 + 1600 + 200 + 220 = 3120
· APAC Sales: 50 + 900 = 950
3.3 Category-Level Aggregation within Each Region: Within each region, the system further precomputes sums at the category level:
US Region
|
category |
SUM(sales_amount) |
|
Electronics |
1200 + 1000 + 2500 + 1800 = 6500 |
EU Region
|
category |
SUM(sales_amount) |
|
Electronics |
1100 + 1600 = 2700 |
|
Clothing |
200 + 220 = 420 |
APAC Region
|
category |
SUM(sales_amount) |
|
Electronics |
900 |
|
Clothing |
50 |
Sub-Category-Level Aggregation within Each Category
At the Sub-Category Level, Pinot further precomputes sums:
3.4 Sub-Category-Level Aggregation within Each Category
At the Sub-Category Level, Pinot further precomputes sums:
US & Electronics
|
sub_category |
SUM(sales_amount) |
|
Mobile |
1200 + 1000 = 2200 |
|
Laptop |
2500 + 1800 = 4300 |
EU & Electronics
|
sub_category |
SUM(sales_amount) |
|
Mobile |
1100 |
|
Laptop |
1600 |
EU& Clothing
|
sub_category |
SUM(sales_amount) |
|
Shoes |
200 + 220 = 420 |
APAC & Electronics
|
sub_category |
SUM(sales_amount) |
|
Mobile |
900 |
APAC & Clothing
|
sub_category |
SUM(sales_amount) |
|
T-Shirts |
50 |
Final tree looks like below.
The index now contains precomputed SUM values at different levels:
Root ├── US (6500) │ ├── Electronics (6500) │ │ ├── Mobile (2200) │ │ ├── Laptop (4300) ├── EU (3120) │ ├── Electronics (2700) │ │ ├── Mobile (1100) │ │ ├── Laptop (1600) │ ├── Clothing (420) │ │ ├── Shoes (420) ├── APAC (950) │ ├── Electronics (900) │ │ ├── Mobile (900) │ ├── Clothing (50) │ │ ├── T-Shirts (50)
Now, if we run a query like below:
SELECT region, category, sub_category, SUM(sales_amount) FROM retail_sales GROUP BY region, category, sub_category;
Instead of scanning the entire dataset, Pinot retrieves precomputed aggregates from the Star-Tree Index, making the query nearly instant.
When to Use Star-Tree Index?
· When queries involve aggregations (SUM, AVG, COUNT, etc.).
· When working with high-cardinality dimensions (e.g., millions of users or products).
· When optimizing for low-latency analytical queries.
It is not useful for raw data lookups or non-aggregated queries.
In summary, Apache Pinot’s Star-Tree Index is a powerful technique for speeding up aggregation queries. By precomputing results and organizing data in a hierarchical tree structure, it significantly reduces query time, making it ideal for real-time analytics on large-scale datasets.
Previous Next Home



No comments:
Post a Comment