Star-tree index is a hybrid indexing technique in Apache Pinot that optimizes query performance for aggregated queries on large datasets. It pre-aggregates data at various levels and stores it in a tree structure, enabling fast lookups without scanning the entire dataset.
Let me explain it with an example.
1. Consider the following retail_sales table:
order_id |
region |
category |
sub_category |
product |
sales_amount |
101 |
US |
Electronics |
Mobile |
iPhone 15 |
1200 |
102 |
US |
Electronics |
Mobile |
Samsung S23 |
1000 |
103 |
US |
Electronics |
Laptop |
MacBook Pro |
2500 |
104 |
US |
Electronics |
Laptop |
Dell XPS |
1800 |
105 |
EU |
Electronics |
Mobile |
iPhone 14 |
1100 |
106 |
EU |
Electronics |
Laptop |
Lenovo ThinkPad |
1600 |
107 |
EU |
Clothing |
Shoes |
Nike AirMax |
200 |
108 |
EU |
Clothing |
Shoes |
Adidas Ultra |
220 |
109 |
APAC |
Clothing |
T-Shirts |
Uniqlo Tee |
50 |
110 |
APAC |
Electronics |
Mobile |
OnePlus 11 |
900 |
2. Configuring Star-Tree Index with 3 Levels
{ "tableName": "retail_sales", "tableType": "OFFLINE", "tableIndexConfig": { "starTreeIndexConfigs": [ { "dimensionsSplitOrder": ["region", "category", "sub_category"], "skipStarNodeCreationForDimensions": ["product"], "functionColumnPairs": ["SUM(sales_amount)"] } ] } }
Now, aggregations (here I set it as sum of sales_amount) are precomputed at each level (region, category, sub_category).
3. Understanding SUM(sales_amount) at Each Node
The Star-Tree Index generates precomputed sums as follows:
3.1 Root Node (Total Sales): At the top level, Pinot precomputes the total sales amount for the entire dataset:
SUM(sales_amount) = 1200 + 1000 + 2500 + 1800 + 1100 + 1600 + 200 + 220 + 50 + 900 = 10570
3.2 Region-Level Aggregation: At the Region Level, Pinot stores:
· US Sales: 1200 + 1000 + 2500 + 1800 = 6500
· EU Sales: 1100 + 1600 + 200 + 220 = 3120
· APAC Sales: 50 + 900 = 950
3.3 Category-Level Aggregation within Each Region: Within each region, the system further precomputes sums at the category level:
US Region
category |
SUM(sales_amount) |
Electronics |
1200 + 1000 + 2500 + 1800 = 6500 |
EU Region
category |
SUM(sales_amount) |
Electronics |
1100 + 1600 = 2700 |
Clothing |
200 + 220 = 420 |
APAC Region
category |
SUM(sales_amount) |
Electronics |
900 |
Clothing |
50 |
Sub-Category-Level Aggregation within Each Category
At the Sub-Category Level, Pinot further precomputes sums:
3.4 Sub-Category-Level Aggregation within Each Category
At the Sub-Category Level, Pinot further precomputes sums:
US & Electronics
sub_category |
SUM(sales_amount) |
Mobile |
1200 + 1000 = 2200 |
Laptop |
2500 + 1800 = 4300 |
EU & Electronics
sub_category |
SUM(sales_amount) |
Mobile |
1100 |
Laptop |
1600 |
EU& Clothing
sub_category |
SUM(sales_amount) |
Shoes |
200 + 220 = 420 |
APAC & Electronics
sub_category |
SUM(sales_amount) |
Mobile |
900 |
APAC & Clothing
sub_category |
SUM(sales_amount) |
T-Shirts |
50 |
Final tree looks like below.
The index now contains precomputed SUM values at different levels:
Root ├── US (6500) │ ├── Electronics (6500) │ │ ├── Mobile (2200) │ │ ├── Laptop (4300) ├── EU (3120) │ ├── Electronics (2700) │ │ ├── Mobile (1100) │ │ ├── Laptop (1600) │ ├── Clothing (420) │ │ ├── Shoes (420) ├── APAC (950) │ ├── Electronics (900) │ │ ├── Mobile (900) │ ├── Clothing (50) │ │ ├── T-Shirts (50)
Now, if we run a query like below:
SELECT region, category, sub_category, SUM(sales_amount) FROM retail_sales GROUP BY region, category, sub_category;
Instead of scanning the entire dataset, Pinot retrieves precomputed aggregates from the Star-Tree Index, making the query nearly instant.
When to Use Star-Tree Index?
· When queries involve aggregations (SUM, AVG, COUNT, etc.).
· When working with high-cardinality dimensions (e.g., millions of users or products).
· When optimizing for low-latency analytical queries.
It is not useful for raw data lookups or non-aggregated queries.
In summary, Apache Pinot’s Star-Tree Index is a powerful technique for speeding up aggregation queries. By precomputing results and organizing data in a hierarchical tree structure, it significantly reduces query time, making it ideal for real-time analytics on large-scale datasets.
Previous Next Home
No comments:
Post a Comment