Friday, 29 August 2025

Transforming Dimensions in Apache Druid: Using Uppercase on City and Product Fields

Apache Druid is a powerful real-time analytics database designed for fast slice-and-dice analytics on large datasets. It supports transforming data during ingestion to help clean, normalize, and enrich it for faster and more effective queries.

In this post, we’ll walk through a simple use case of applying transformations to columns in a CSV file before ingesting it into Druid. Specifically, we’ll demonstrate how to use the upper transformation on the city and product columns to ensure consistent capitalization, an essential step in preparing data for reliable filtering, grouping, and analysis.

 

Whether you're new to Druid or exploring your first ingestion spec, this post will help you get started with transformations.

 

sales_data.csv 

timestamp,product,city,sales
2025-04-01T10:00:00Z,Laptop,Delhi,300
2025-04-01T10:00:00Z,Laptop,Delhi,200
2025-04-01T11:00:00Z,Tablet,Mumbai,150
2025-04-01T11:00:00Z,Tablet,Mumbai,50
2025-04-01T12:00:00Z,Mobile,Bengaluru,200
2025-04-01T13:00:00Z,Laptop,Hyderabad,250
2025-04-01T14:00:00Z,Tablet,Chennai,180
2025-04-01T15:00:00Z,Mobile,Pune,220
2025-04-01T15:00:00Z,Mobile,Pune,80

Let’s transform the product and city columns to uppercase before they are stored in Druid.

 

We’ll define an ingestion spec using the Druid native batch ingestion method and use the transformSpec section to apply the upper function on both product and city.

 

sales_data_transformation.json

 

{
  "type": "index",
  "spec": {
    "dataSchema": {
      "dataSource": "sales_data_transformation_demo",
      "timestampSpec": {
        "column": "timestamp",
        "format": "iso"
      },
      "dimensionsSpec": {
        "dimensions": [
          "product_upper",
          "city_upper"
        ]
      },
      "transformSpec": {
        "transforms": [
          {
            "type": "expression",
            "name": "product_upper",
            "expression": "upper(product)"
          },
          {
            "type": "expression",
            "name": "city_upper",
            "expression": "upper(city)"
          }
        ]
      },
      "metricsSpec": [
        {
          "name": "total_sales",
          "type": "doubleSum",
          "fieldName": "sales"
        }
      ],
      "granularitySpec": {
        "type": "uniform",
        "segmentGranularity": "day",
        "queryGranularity": "hour",
        "rollup": false
      }
    },
    "ioConfig": {
      "type": "index",
      "inputSource": {
        "type": "local",
        "baseDir": "/Users/Shared/druid_samples",
        "filter": "sales_data.csv"
      },
      "inputFormat": {
        "type": "csv",
        "columns": ["timestamp", "product", "city", "sales"]
      }
    },
    "tuningConfig": {
      "type": "index"
    }
  }
}

Execute following curl command to submit sales_data_transformation.json spec to Druid.  

 

curl -X POST http://localhost:8081/druid/indexer/v1/task \
     -H 'Content-Type: application/json' \
     -d @sales_data_transformation.json

curl -X POST http://localhost:8081/druid/indexer/v1/task \
>      -H 'Content-Type: application/json' \
>      -d @sales_data_transformation.json
{"task":"index_sales_data_transformation_demo_ojacglhm_2025-04-18T13:11:30.578Z"}

Navigate to Druid Workbench -> Query tab, and execute following sql query.

SELECT * FROM "sales_data_transformation_demo"

 

How Transformation is Done in This Spec?

In this ingestion spec, transformations are handled by the transformSpec block. This block is used to define new fields that are derived from existing columns using expression functions during ingestion.


 

 

"transformSpec": {
  "transforms": [
    {
      "type": "expression",
      "name": "product_upper",
      "expression": "upper(product)"
    },
    {
      "type": "expression",
      "name": "city_upper",
      "expression": "upper(city)"
    }
  ]
}

What This Does?

Defines two new fields:

·      product_upper

·      city_upper

 

Each field is created by applying the upper() function:

·      product_upper takes the value of the original product column and converts it to uppercase.

·      city_upper does the same for the city column.

 

If the original row looks like this:

2025-04-01T10:00:00Z,Laptop,Delhi,300

 

After applying the transformations:

 

·      product_upper becomes LAPTOP

·      city_upper becomes DELHI

 

These transformed fields are then treated as dimensions (declared in dimensionsSpec) and can be used for filtering, grouping, or displaying in queries.

 

 

Previous                                                    Next                                                    Home

No comments:

Post a Comment