Thursday, 28 August 2025

Ingesting Files Using post-index-task of Druid command

In this post, you’ll learn how to use the post-index-task command-line tool to ingest a CSV file into Druid running locally.

Sample Data

We’ll be working with a file called sales_data.csv:

timestamp,product,city,sales
2025-04-01T10:00:00Z,Laptop,Delhi,300
2025-04-01T10:00:00Z,Laptop,Delhi,200
2025-04-01T11:00:00Z,Tablet,Mumbai,150
2025-04-01T11:00:00Z,Tablet,Mumbai,50
2025-04-01T12:00:00Z,Mobile,Bengaluru,200
2025-04-01T13:00:00Z,Laptop,Hyderabad,250
2025-04-01T14:00:00Z,Tablet,Chennai,180
2025-04-01T15:00:00Z,Mobile,Pune,220
2025-04-01T15:00:00Z,Mobile,Pune,80

Ingestion Spec

To load this data into Druid, we need an ingestion spec that tells Druid how to parse the file and where to store it.

 

Save the following JSON as:

sales_data_via_post_index_ingestion_spec.json

{
  "type": "index",
  "spec": {
    "dataSchema": {
      "dataSource": "sales_data_via_post_index",
      "timestampSpec": {
        "column": "timestamp",
        "format": "iso"
      },
      "dimensionsSpec": {
        "dimensions": ["product", "city"]
      },
      "metricsSpec": [
        {
          "type": "doubleSum",
          "name": "total_sales",
          "fieldName": "sales"
        }
      ],
      "granularitySpec": {
        "type": "uniform",
        "segmentGranularity": "day",
        "queryGranularity": "none",
        "rollup": false
      }
    },
    "ioConfig": {
      "type": "index",
      "inputSource": {
        "type": "local",
        "baseDir": "/Users/Shared/druid_samples",
        "filter": "sales_data.csv"
      },
      "inputFormat": {
        "type": "csv",
        "findColumnsFromHeader": true
      }
    },
    "tuningConfig": {
      "type": "index"
    }
  }
}

 

Running the Ingestion

Use the post-index-task tool to post the ingestion spec to Druid:

post-index-task --file sales_data_via_post_index_ingestion_spec.json --url http://localhost:8081

‘post-index-task’ command line tool is available in Druid bin directory.

$post-index-task --file sales_data_via_post_index_ingestion_spec.json --url http://localhost:8081
Beginning indexing data for sales_data_via_post_index
Task started: index_sales_data_via_post_index_ccbpifce_2025-04-18T08:44:43.646Z
Task log:     http://localhost:8081/druid/indexer/v1/task/index_sales_data_via_post_index_ccbpifce_2025-04-18T08:44:43.646Z/log
Task status:  http://localhost:8081/druid/indexer/v1/task/index_sales_data_via_post_index_ccbpifce_2025-04-18T08:44:43.646Z/status
Task index_sales_data_via_post_index_ccbpifce_2025-04-18T08:44:43.646Z still running...
Task finished with status: SUCCESS
Completed indexing data for sales_data_via_post_index. Now loading indexed data onto the cluster...
sales_data_via_post_index is 0.0% finished loading...
sales_data_via_post_index loading complete! You may now query your data

 

Go to Druid Query page and exeucte following SELECT statement.

SELECT * FROM "sales_data_via_post_index"

 

You can see the records of "sales_data_via_post_index" datasource.

 


With just one command, you’ve ingested your first CSV file into Druid using post-index-task! This method is great for testing and small-scale imports. For production ingestion, you'd typically use Druid's supervisor-based ingestion or connect to cloud object stores.

 

Previous                                                    Next                                                    Home

No comments:

Post a Comment