When you ingest data into Apache Druid, it's stored in segments. Segments are optimized, compressed data blocks that enable fast querying and efficient storage. Understanding how segmentation works is essential for designing performant Druid setups.
In this blog, we’ll explore what segments are, how Druid creates them based on your data and configurations and walk through an example.
Segments in Apache Druid
In Apache Druid, a segment is a chunk of data that covers a specific time interval, like an hour or a day.
· Segments are the basic unit of storage and query execution.
· They are compressed, column-oriented, and distributed across Druid nodes for parallel querying.
· By controlling segment granularity, you can optimize your ingestion and query performance.
Let’s simulate blog visit logs where each entry represents a visit to a blog post at a specific timestamp. We’ll segment the data hourly, so if your data spans 10 different hours, Druid will create 10 separate segments.
Let’s create blog_visits.csv file.
blog_visits.csv
timestamp,post_id,user_id 2025-04-18T00:10:00Z,101,u1 2025-04-18T01:05:00Z,101,u2 2025-04-18T02:15:00Z,102,u3 2025-04-18T03:30:00Z,103,u4 2025-04-18T04:25:00Z,104,u5 2025-04-18T05:55:00Z,105,u6 2025-04-18T06:45:00Z,106,u7 2025-04-18T07:20:00Z,107,u8 2025-04-18T08:10:00Z,108,u9 2025-04-18T09:40:00Z,109,u10 2025-04-18T10:01:00Z,110,u11
This dataset spans 11 hours, which will generate 11 segments when using hourly segment granularity.
Ingestion Spec
blog_visits_spec.json
{ "type": "index", "spec": { "dataSchema": { "dataSource": "blog_visits", "timestampSpec": { "column": "timestamp", "format": "iso" }, "dimensionsSpec": { "dimensions": [ "post_id", "user_id" ] }, "metricsSpec": [], "granularitySpec": { "type": "uniform", "segmentGranularity": "HOUR", "queryGranularity": "NONE", "rollup": false } }, "ioConfig": { "type": "index", "inputSource": { "type": "local", "baseDir": "/Users/Shared/druid_samples", "filter": "blog_visits.csv" }, "inputFormat": { "type": "csv", "columns": [ "timestamp", "post_id", "user_id" ], "findColumnsFromHeader": false, "delimiter": "," } }, "tuningConfig": { "type": "index", "maxRowsInMemory": 10000, "maxBytesInMemory": 0, "maxRowsPerSegment": 5000000, "maxTotalRows": 20000000, "forceExtendableShardSpecs": true, "logParseExceptions": true, "resetOffsetAutomatically": true } } }
Execute following curl statement to onboard above spec.
curl -X POST http://localhost:8081/druid/indexer/v1/task \ -H 'Content-Type: application/json' \ -d @blog_visits_spec.json
Upon successful onboarding of blog_visits data, navigate to Druid console -> Segments tab.
Filter by blog_visits datasource, you can see all the segments.
Best Practices to choose segment granularity
· Choose segment granularity based on data volume and query patterns.
· Use hourly for high-volume, high-frequency data like logs or metrics.
· Use daily or coarser granularity for less frequent data.
Previous Next Home
No comments:
Post a Comment