Monday, 1 September 2025

Filtering Rows During Ingestion in Apache Druid

When ingesting data into Apache Druid, sometimes you don't want to load everything, you may want to filter out unwanted rows. For example, you might want to ingest only rows where the product is "Mobile", or the city is "Delhi".

 

In this blog post, we’ll learn how to use Druid’s transformSpec.filter to include only the rows we care about. We’ll demonstrate this using a simple CSV file with product sales data.

 

sales_data.csv

timestamp,product,city,sales
2025-04-01T10:00:00Z,Laptop,Delhi,300
2025-04-01T10:00:00Z,Laptop,Delhi,200
2025-04-01T11:00:00Z,Tablet,Mumbai,150
2025-04-01T11:00:00Z,Tablet,Mumbai,50
2025-04-01T12:00:00Z,Mobile,Bengaluru,200
2025-04-01T13:00:00Z,Laptop,Hyderabad,250
2025-04-01T14:00:00Z,Tablet,Chennai,180
2025-04-01T15:00:00Z,Mobile,Pune,220
2025-04-01T15:00:00Z,Mobile,Pune,80

I want to ingest only rows where product = Mobile using a filter.

 

Druid Ingestion Spec with a Filter is given below.

 

sales_data_filters.json

 

{
  "type": "index",
  "spec": {
    "dataSchema": {
      "dataSource": "sales_data_filter_demo",
      "timestampSpec": {
        "column": "timestamp",
        "format": "iso"
      },
      "dimensionsSpec": {
        "dimensions": ["product", "city"]
      },
      "transformSpec": {
        "filter": {
          "type": "selector",
          "dimension": "product",
          "value": "Mobile"
        }
      },
      "metricsSpec": [
        {
          "name": "total_sales",
          "type": "doubleSum",
          "fieldName": "sales"
        }
      ],
      "granularitySpec": {
        "type": "uniform",
        "segmentGranularity": "day",
        "queryGranularity": "hour",
        "rollup": false
      }
    },
    "ioConfig": {
      "type": "index",
      "inputSource": {
        "type": "local",
        "baseDir": "/Users/Shared/druid_samples",
        "filter": "sales_data.csv"
      },
      "inputFormat": {
        "type": "csv",
        "columns": ["timestamp", "product", "city", "sales"]
      }
    },
    "tuningConfig": {
      "type": "index"
    }
  }
}

Execute below command to submit the task from sales_data_filters.json spec file.  

 

curl -X POST http://localhost:8081/druid/indexer/v1/task \
     -H 'Content-Type: application/json' \
     -d @sales_data_filters.json

Navigate to Druid workbench -> Query tab and execute following sql query.  

 

SELECT * FROM "sales_data_filter_demo"

You can see only Mobile specific data.

  

Previous                                                    Next                                                    Home

No comments:

Post a Comment