Thursday, 4 September 2025

Understanding Roll-up in Druid

In Apache Druid, roll-up is a process where incoming data is aggregated at ingestion time, reducing the amount of data stored and speeding up queries, at the cost of losing some data granularity.

Benefits of Roll-up

Benefit

Description

Reduced Storage 

Less data stored due to aggregation

Faster Queries 

Aggregated data = less to scan

Cost Efficient 

Less disk + faster queries = $$$ saved

 

Drawbacks of Roll-up

Drawback

Description

Loss of Raw Data

You lose the ability to query individual records

Less Flexibility

Can't drill down to raw events if needed

Needs Good Design

Choosing proper dimensions and granularity is critical

 

Let’s take following csv file to understand Roll up behavior.

 

warehouse_items.csv  

item_id,city,quantity,sales,date
001,New York,100,50,2024-01-01T00:00:00Z
002,Los Angeles,150,100,2024-01-01T00:00:00Z
003,New York,200,150,2024-01-01T00:00:00Z
004,San Francisco,300,200,2024-01-02T00:00:00Z
005,New York,100,60,2024-01-02T00:00:00Z
006,Los Angeles,200,120,2024-01-02T00:00:00Z
007,San Francisco,150,130,2024-01-02T00:00:00Z
008,New York,250,200,2024-01-02T00:00:00Z
009,San Francisco,100,80,2024-01-03T00:00:00Z
010,New York,300,250,2024-01-03T00:00:00Z

 

warehouse_no_roll_up.json

{
  "type": "index",
  "spec": {
    "dataSchema": {
      "dataSource": "warehouse_items_raw",
      "timestampSpec": {
        "column": "date",
        "format": "iso"
      },
      "dimensionsSpec": {
        "dimensions": ["city"] 
      },
      "metricsSpec": [
        { "name": "total_quantity", "type": "longSum", "fieldName": "quantity" },
        { "name": "total_sales", "type": "longSum", "fieldName": "sales" }
      ],
      "granularitySpec": {
        "type": "uniform",
        "segmentGranularity": "day",
        "queryGranularity": "none",
        "rollup": false
      }
    },
    "ioConfig": {
      "type": "index",
      "inputSource": {
        "type": "local",
        "baseDir": "/Users/Shared/druid_samples",
        "filter": "warehouse_items.csv"
      },
      "inputFormat": {
        "type": "csv",
        "findColumnsFromHeader": true,
        "skipHeaderRecords": 0
      }
    },
    "tuningConfig": {
      "type": "index"
    }
  }
}

 

Execute following curl statement to onboard above spec.

curl -X POST http://localhost:8081/druid/indexer/v1/task \
     -H 'Content-Type: application/json' \
     -d @warehouse_no_roll_up.json

Go to Druid Query tab and execute following command.

SELECT * FROM "warehouse_items_raw"

You can see following data.


 

Let’s create a roll out spec.

 

warehouse_roll_up.json 

{
  "type": "index",
  "spec": {
    "dataSchema": {
      "dataSource": "warehouse_items_rollup",
      "timestampSpec": {
        "column": "date",
        "format": "iso"
      },
      "dimensionsSpec": {
        "dimensions": ["city"]
      },
      "metricsSpec": [
        { "name": "record_count", "type": "count" },
        { "name": "total_quantity", "type": "longSum", "fieldName": "quantity" },
        { "name": "total_sales", "type": "longSum", "fieldName": "sales" }
      ],
      "granularitySpec": {
        "type": "uniform",
        "segmentGranularity": "day",
        "queryGranularity": "day",
        "rollup": true
      }
    },
    "ioConfig": {
      "type": "index",
      "inputSource": {
        "type": "local",
        "baseDir": "/Users/Shared/druid_samples",
        "filter": "warehouse_items.csv"
      },
      "inputFormat": {
        "type": "csv",
        "findColumnsFromHeader": true,
        "skipHeaderRecords": 0
      }
    },
    "tuningConfig": {
      "type": "index"
    }
  }
}

Execute following curl to onboard above spec.

curl -X POST http://localhost:8081/druid/indexer/v1/task \
     -H 'Content-Type: application/json' \
     -d @warehouse_roll_up.json

 


You can observe that the data is aggregated at City and day level.

 

Why data is aggerated at day level?

The queryGranularity is set to day, which instructs Druid to aggregate the data on a per-day basis.


 

Previous                                                    Next                                                    Home

No comments:

Post a Comment