In Apache Druid, roll-up is a process where incoming data is aggregated at ingestion time, reducing the amount of data stored and speeding up queries, at the cost of losing some data granularity.
Benefits of Roll-up
Benefit |
Description |
Reduced Storage |
Less data stored due to aggregation |
Faster Queries |
Aggregated data = less to scan |
Cost Efficient |
Less disk + faster queries = $$$ saved |
Drawbacks of Roll-up
Drawback |
Description |
Loss of Raw Data |
You lose the ability to query individual records |
Less Flexibility |
Can't drill down to raw events if needed |
Needs Good Design |
Choosing proper dimensions and granularity is critical |
Let’s take following csv file to understand Roll up behavior.
warehouse_items.csv
item_id,city,quantity,sales,date 001,New York,100,50,2024-01-01T00:00:00Z 002,Los Angeles,150,100,2024-01-01T00:00:00Z 003,New York,200,150,2024-01-01T00:00:00Z 004,San Francisco,300,200,2024-01-02T00:00:00Z 005,New York,100,60,2024-01-02T00:00:00Z 006,Los Angeles,200,120,2024-01-02T00:00:00Z 007,San Francisco,150,130,2024-01-02T00:00:00Z 008,New York,250,200,2024-01-02T00:00:00Z 009,San Francisco,100,80,2024-01-03T00:00:00Z 010,New York,300,250,2024-01-03T00:00:00Z
warehouse_no_roll_up.json
{ "type": "index", "spec": { "dataSchema": { "dataSource": "warehouse_items_raw", "timestampSpec": { "column": "date", "format": "iso" }, "dimensionsSpec": { "dimensions": ["city"] }, "metricsSpec": [ { "name": "total_quantity", "type": "longSum", "fieldName": "quantity" }, { "name": "total_sales", "type": "longSum", "fieldName": "sales" } ], "granularitySpec": { "type": "uniform", "segmentGranularity": "day", "queryGranularity": "none", "rollup": false } }, "ioConfig": { "type": "index", "inputSource": { "type": "local", "baseDir": "/Users/Shared/druid_samples", "filter": "warehouse_items.csv" }, "inputFormat": { "type": "csv", "findColumnsFromHeader": true, "skipHeaderRecords": 0 } }, "tuningConfig": { "type": "index" } } }
Execute following curl statement to onboard above spec.
curl -X POST http://localhost:8081/druid/indexer/v1/task \ -H 'Content-Type: application/json' \ -d @warehouse_no_roll_up.json
Go to Druid Query tab and execute following command.
SELECT * FROM "warehouse_items_raw"
You can see following data.
Let’s create a roll out spec.
warehouse_roll_up.json
{ "type": "index", "spec": { "dataSchema": { "dataSource": "warehouse_items_rollup", "timestampSpec": { "column": "date", "format": "iso" }, "dimensionsSpec": { "dimensions": ["city"] }, "metricsSpec": [ { "name": "record_count", "type": "count" }, { "name": "total_quantity", "type": "longSum", "fieldName": "quantity" }, { "name": "total_sales", "type": "longSum", "fieldName": "sales" } ], "granularitySpec": { "type": "uniform", "segmentGranularity": "day", "queryGranularity": "day", "rollup": true } }, "ioConfig": { "type": "index", "inputSource": { "type": "local", "baseDir": "/Users/Shared/druid_samples", "filter": "warehouse_items.csv" }, "inputFormat": { "type": "csv", "findColumnsFromHeader": true, "skipHeaderRecords": 0 } }, "tuningConfig": { "type": "index" } } }
Execute following curl to onboard above spec.
curl -X POST http://localhost:8081/druid/indexer/v1/task \ -H 'Content-Type: application/json' \ -d @warehouse_roll_up.json
You can observe that the data is aggregated at City and day level.
Why data is aggerated at day level?
The queryGranularity is set to day, which instructs Druid to aggregate the data on a per-day basis.
Previous Next Home
No comments:
Post a Comment