In this post, you’ll learn how to use the post-index-task command-line tool to ingest a CSV file into Druid running locally.
Sample Data
We’ll be working with a file called sales_data.csv:
timestamp,product,city,sales 2025-04-01T10:00:00Z,Laptop,Delhi,300 2025-04-01T10:00:00Z,Laptop,Delhi,200 2025-04-01T11:00:00Z,Tablet,Mumbai,150 2025-04-01T11:00:00Z,Tablet,Mumbai,50 2025-04-01T12:00:00Z,Mobile,Bengaluru,200 2025-04-01T13:00:00Z,Laptop,Hyderabad,250 2025-04-01T14:00:00Z,Tablet,Chennai,180 2025-04-01T15:00:00Z,Mobile,Pune,220 2025-04-01T15:00:00Z,Mobile,Pune,80
Ingestion Spec
To load this data into Druid, we need an ingestion spec that tells Druid how to parse the file and where to store it.
Save the following JSON as:
sales_data_via_post_index_ingestion_spec.json
{ "type": "index", "spec": { "dataSchema": { "dataSource": "sales_data_via_post_index", "timestampSpec": { "column": "timestamp", "format": "iso" }, "dimensionsSpec": { "dimensions": ["product", "city"] }, "metricsSpec": [ { "type": "doubleSum", "name": "total_sales", "fieldName": "sales" } ], "granularitySpec": { "type": "uniform", "segmentGranularity": "day", "queryGranularity": "none", "rollup": false } }, "ioConfig": { "type": "index", "inputSource": { "type": "local", "baseDir": "/Users/Shared/druid_samples", "filter": "sales_data.csv" }, "inputFormat": { "type": "csv", "findColumnsFromHeader": true } }, "tuningConfig": { "type": "index" } } }
Running the Ingestion
Use the post-index-task tool to post the ingestion spec to Druid:
post-index-task --file sales_data_via_post_index_ingestion_spec.json --url http://localhost:8081
‘post-index-task’ command line tool is available in Druid bin directory.
$post-index-task --file sales_data_via_post_index_ingestion_spec.json --url http://localhost:8081 Beginning indexing data for sales_data_via_post_index Task started: index_sales_data_via_post_index_ccbpifce_2025-04-18T08:44:43.646Z Task log: http://localhost:8081/druid/indexer/v1/task/index_sales_data_via_post_index_ccbpifce_2025-04-18T08:44:43.646Z/log Task status: http://localhost:8081/druid/indexer/v1/task/index_sales_data_via_post_index_ccbpifce_2025-04-18T08:44:43.646Z/status Task index_sales_data_via_post_index_ccbpifce_2025-04-18T08:44:43.646Z still running... Task finished with status: SUCCESS Completed indexing data for sales_data_via_post_index. Now loading indexed data onto the cluster... sales_data_via_post_index is 0.0% finished loading... sales_data_via_post_index loading complete! You may now query your data
Go to Druid Query page and exeucte following SELECT statement.
SELECT * FROM "sales_data_via_post_index"
You can see the records of "sales_data_via_post_index" datasource.
With just one command, you’ve ingested your first CSV file into Druid using post-index-task! This method is great for testing and small-scale imports. For production ingestion, you'd typically use Druid's supervisor-based ingestion or connect to cloud object stores.
Previous Next Home
No comments:
Post a Comment