Programming for beginners: Understanding Roll-up in Druid

In Apache Druid, roll-up is a process where incoming data is aggregated at ingestion time, reducing the amount of data stored and speeding up queries, at the cost of losing some data granularity.

Benefits of Roll-up

Benefit	Description
Reduced Storage	Less data stored due to aggregation
Faster Queries	Aggregated data = less to scan
Cost Efficient	Less disk + faster queries = $$$ saved

Drawbacks of Roll-up

Drawback	Description
Loss of Raw Data	You lose the ability to query individual records
Less Flexibility	Can't drill down to raw events if needed
Needs Good Design	Choosing proper dimensions and granularity is critical

Let’s take following csv file to understand Roll up behavior.

warehouse_items.csv

item_id,city,quantity,sales,date
001,New York,100,50,2024-01-01T00:00:00Z
002,Los Angeles,150,100,2024-01-01T00:00:00Z
003,New York,200,150,2024-01-01T00:00:00Z
004,San Francisco,300,200,2024-01-02T00:00:00Z
005,New York,100,60,2024-01-02T00:00:00Z
006,Los Angeles,200,120,2024-01-02T00:00:00Z
007,San Francisco,150,130,2024-01-02T00:00:00Z
008,New York,250,200,2024-01-02T00:00:00Z
009,San Francisco,100,80,2024-01-03T00:00:00Z
010,New York,300,250,2024-01-03T00:00:00Z

warehouse_no_roll_up.json

{
  "type": "index",
  "spec": {
    "dataSchema": {
      "dataSource": "warehouse_items_raw",
      "timestampSpec": {
        "column": "date",
        "format": "iso"
      },
      "dimensionsSpec": {
        "dimensions": ["city"] 
      },
      "metricsSpec": [
        { "name": "total_quantity", "type": "longSum", "fieldName": "quantity" },
        { "name": "total_sales", "type": "longSum", "fieldName": "sales" }
      ],
      "granularitySpec": {
        "type": "uniform",
        "segmentGranularity": "day",
        "queryGranularity": "none",
        "rollup": false
      }
    },
    "ioConfig": {
      "type": "index",
      "inputSource": {
        "type": "local",
        "baseDir": "/Users/Shared/druid_samples",
        "filter": "warehouse_items.csv"
      },
      "inputFormat": {
        "type": "csv",
        "findColumnsFromHeader": true,
        "skipHeaderRecords": 0
      }
    },
    "tuningConfig": {
      "type": "index"
    }
  }
}

Execute following curl statement to onboard above spec.

curl -X POST http://localhost:8081/druid/indexer/v1/task \
     -H 'Content-Type: application/json' \
     -d @warehouse_no_roll_up.json

Go to Druid Query tab and execute following command.

SELECT * FROM "warehouse_items_raw"

You can see following data.

Let’s create a roll out spec.

warehouse_roll_up.json

{
  "type": "index",
  "spec": {
    "dataSchema": {
      "dataSource": "warehouse_items_rollup",
      "timestampSpec": {
        "column": "date",
        "format": "iso"
      },
      "dimensionsSpec": {
        "dimensions": ["city"]
      },
      "metricsSpec": [
        { "name": "record_count", "type": "count" },
        { "name": "total_quantity", "type": "longSum", "fieldName": "quantity" },
        { "name": "total_sales", "type": "longSum", "fieldName": "sales" }
      ],
      "granularitySpec": {
        "type": "uniform",
        "segmentGranularity": "day",
        "queryGranularity": "day",
        "rollup": true
      }
    },
    "ioConfig": {
      "type": "index",
      "inputSource": {
        "type": "local",
        "baseDir": "/Users/Shared/druid_samples",
        "filter": "warehouse_items.csv"
      },
      "inputFormat": {
        "type": "csv",
        "findColumnsFromHeader": true,
        "skipHeaderRecords": 0
      }
    },
    "tuningConfig": {
      "type": "index"
    }
  }
}

Execute following curl to onboard above spec.

curl -X POST http://localhost:8081/druid/indexer/v1/task \
     -H 'Content-Type: application/json' \
     -d @warehouse_roll_up.json

You can observe that the data is aggregated at City and day level.

Why data is aggerated at day level?

The queryGranularity is set to day, which instructs Druid to aggregate the data on a per-day basis.

Previous Next Home

Programming for beginners

Thursday, 4 September 2025

Understanding Roll-up in Druid

No comments:

Post a Comment