Tuesday, 9 September 2025

Understanding Segments in Apache Druid

When you ingest data into Apache Druid, it's stored in segments. Segments are optimized, compressed data blocks that enable fast querying and efficient storage. Understanding how segmentation works is essential for designing performant Druid setups.

 

In this blog, we’ll explore what segments are, how Druid creates them based on your data and configurations and walk through an example.

 

Segments in Apache Druid

In Apache Druid, a segment is a chunk of data that covers a specific time interval, like an hour or a day.

 

·      Segments are the basic unit of storage and query execution.

·      They are compressed, column-oriented, and distributed across Druid nodes for parallel querying.

·      By controlling segment granularity, you can optimize your ingestion and query performance.

 

Let’s simulate blog visit logs where each entry represents a visit to a blog post at a specific timestamp. We’ll segment the data hourly, so if your data spans 10 different hours, Druid will create 10 separate segments.

 

Let’s create blog_visits.csv file.

 

blog_visits.csv 

timestamp,post_id,user_id
2025-04-18T00:10:00Z,101,u1
2025-04-18T01:05:00Z,101,u2
2025-04-18T02:15:00Z,102,u3
2025-04-18T03:30:00Z,103,u4
2025-04-18T04:25:00Z,104,u5
2025-04-18T05:55:00Z,105,u6
2025-04-18T06:45:00Z,106,u7
2025-04-18T07:20:00Z,107,u8
2025-04-18T08:10:00Z,108,u9
2025-04-18T09:40:00Z,109,u10
2025-04-18T10:01:00Z,110,u11

This dataset spans 11 hours, which will generate 11 segments when using hourly segment granularity.

 

Ingestion Spec

 

blog_visits_spec.json

 

{
  "type": "index",
  "spec": {
    "dataSchema": {
      "dataSource": "blog_visits",
      "timestampSpec": {
        "column": "timestamp",
        "format": "iso"
      },
      "dimensionsSpec": {
        "dimensions": [
          "post_id",
          "user_id"
        ]
      },
      "metricsSpec": [],
      "granularitySpec": {
        "type": "uniform",
        "segmentGranularity": "HOUR",
        "queryGranularity": "NONE",
        "rollup": false
      }
    },
    "ioConfig": {
      "type": "index",
      "inputSource": {
        "type": "local",
        "baseDir": "/Users/Shared/druid_samples",
        "filter": "blog_visits.csv"
      },
      "inputFormat": {
        "type": "csv",
        "columns": [
          "timestamp",
          "post_id",
          "user_id"
        ],
        "findColumnsFromHeader": false,
        "delimiter": ","
      }
    },
    "tuningConfig": {
      "type": "index",
      "maxRowsInMemory": 10000,
      "maxBytesInMemory": 0,
      "maxRowsPerSegment": 5000000,
      "maxTotalRows": 20000000,
      "forceExtendableShardSpecs": true,
      "logParseExceptions": true,
      "resetOffsetAutomatically": true
    }
  }
}

Execute following curl statement to onboard above spec.

curl -X POST http://localhost:8081/druid/indexer/v1/task \
     -H 'Content-Type: application/json' \
     -d @blog_visits_spec.json

Upon successful onboarding of blog_visits data, navigate to Druid console -> Segments tab.

 

Filter by blog_visits datasource, you can see all the segments.


 

Best Practices to choose segment granularity

·      Choose segment granularity based on data volume and query patterns.

·      Use hourly for high-volume, high-frequency data like logs or metrics.

·      Use daily or coarser granularity for less frequent data.

  

Previous                                                    Next                                                    Home

No comments:

Post a Comment