Tuesday, 16 September 2025

Exploring Apache Druid Native Queries

Apache Druid is a real-time analytics database designed for fast slice-and-dice analytics on large datasets. While SQL is a convenient way to query data in Druid, native queries provide greater flexibility, control, and performance tuning.

 

In this blog post, we’ll explore how to use Druid’s scan native query to extract raw event-level data from a sales_data datasource. We'll walk through examples using real-world sales data.

 

Sample Data: sales_data

I ingested the following CSV into Druid with the name sales_data.

timestamp,product,city,total_sales
2025-04-01T10:00:00Z,Laptop,Delhi,300
2025-04-01T10:00:00Z,Laptop,Delhi,200
2025-04-01T11:00:00Z,Tablet,Mumbai,150
2025-04-01T11:00:00Z,Tablet,Mumbai,50
2025-04-01T12:00:00Z,Mobile,Bengaluru,200
2025-04-01T13:00:00Z,Laptop,Hyderabad,250
2025-04-01T14:00:00Z,Tablet,Chennai,180
2025-04-01T15:00:00Z,Mobile,Pune,220
2025-04-01T15:00:00Z,Mobile,Pune,80

Each row represents a product sale at a specific timestamp, city, and quantity.

 

1. Scan Query in Druid

The scan query in Druid is used to fetch raw data from your datasource. Unlike groupBy or timeseries queries that aggregate data, scan queries return unprocessed rows in a customizable format.

 

Here’s a basic scan query that fetches all data from the sales_data datasource between two timestamps.

 

{
  "queryType": "scan",
  "dataSource": "sales_data",
  "resultFormat": "compactedList",
  "columns": ["__time", "product", "city", "total_sales"],
  "intervals": ["2025-04-01T10:00:00Z/2025-04-01T16:00:00Z"],
  "batchSize": 35575,
  "limit": 100
}

 


 

Following table summarizes the native query payload.

 

Field

Purpose

queryType

Specifies the query type. In this case, scan

dataSource 

The name of the Druid datasource (your table)

resultFormat 

Format of the result: compactedList, list, or valueVector

columns

List of columns to include in the response

intervals

Time interval filter, ISO 8601 format

batchSize

Number of rows fetched internally per scan

limit

Maximum number of rows returned in the result

 

Output json looks like below.

 

Output 

[
  {
    "segmentId": "sales_data_2025-04-01T00:00:00.000Z_2025-04-02T00:00:00.000Z_2025-04-18T08:18:17.761Z",
    "columns": [
      "__time",
      "product",
      "city",
      "total_sales"
    ],
    "events": [
      [
        1743501600000,
        "Laptop",
        "Delhi",
        300
      ],
      [
        1743501600000,
        "Laptop",
        "Delhi",
        200
      ],
      [
        1743505200000,
        "Tablet",
        "Mumbai",
        150
      ],
      [
        1743505200000,
        "Tablet",
        "Mumbai",
        50
      ],
      [
        1743508800000,
        "Mobile",
        "Bengaluru",
        200
      ],
      [
        1743512400000,
        "Laptop",
        "Hyderabad",
        250
      ],
      [
        1743516000000,
        "Tablet",
        "Chennai",
        180
      ],
      [
        1743519600000,
        "Mobile",
        "Pune",
        220
      ],
      [
        1743519600000,
        "Mobile",
        "Pune",
        80
      ]
    ],
    "rowSignature": [
      {
        "name": "__time",
        "type": "LONG"
      },
      {
        "name": "product",
        "type": "STRING"
      },
      {
        "name": "city",
        "type": "STRING"
      },
      {
        "name": "total_sales",
        "type": "DOUBLE"
      }
    ]
  }
]

  Example 1:  Retrieve all rows where sales happened in a specific hour

 

{
  "queryType": "scan",
  "dataSource": "sales_data",
  "resultFormat": "compactedList",
  "columns": ["__time", "product", "city", "total_sales"],
  "intervals": ["2025-04-01T11:00:00Z/2025-04-01T12:00:00Z"]
}

 

Example 2: Filter specific columns only (e.g., product, city and total_sales) 

{
  "queryType": "scan",
  "dataSource": "sales_data",
  "resultFormat": "list",
  "columns": ["product", "city", "total_sales"],
  "intervals": ["2025-04-01T11:00:00Z/2025-04-01T12:00:00Z"]
}

 

Example 3: Limit the number of records returned.

{
  "queryType": "scan",
  "dataSource": "sales_data",
  "resultFormat": "compactedList",
  "columns": ["__time", "product", "total_sales"],
  "intervals": ["2025-04-01T10:00:00Z/2025-04-01T16:00:00Z"],
  "limit": 3
}

 

Example 4: Using resultFormat as list.

{
  "queryType": "scan",
  "dataSource": "sales_data",
  "resultFormat": "list",
  "columns": ["__time", "city", "total_sales"],
  "intervals": ["2025-04-01T13:00:00Z/2025-04-01T16:00:00Z"]
}

2. Druid GroupBy Native Queries with Filters

While scan queries in Druid return raw data, groupBy queries are used to aggregate data much like SQL GROUP BY. This is useful when you want to answer questions like:

 

·      What is the total sales per product?

·      What is the sales count by city per hour?

·      How much did each city sell after 2 PM?

 

 

Example 1: Total Sales by Product

{
  "queryType": "groupBy",
  "dataSource": "sales_data",
  "intervals": ["2025-04-01T10:00:00Z/2025-04-01T16:00:00Z"],
  "granularity": "all",
  "dimensions": ["product"],
  "aggregations": [
    { "type": "longSum", "name": "sales_by_product", "fieldName": "total_sales" }
  ]
}

 


Example 2: Sales per City after 1 PM

{
  "queryType": "groupBy",
  "dataSource": "sales_data",
  "intervals": ["2025-04-01T13:00:00Z/2025-04-01T16:00:00Z"],
  "granularity": "all",
  "dimensions": ["city"],
  "aggregations": [
    { "type": "longSum", "name": "sales_per_city", "fieldName": "total_sales" }
  ]
}

Example 3: Add a WHERE Clause (Filter by product = 'Laptop')

{
  "queryType": "groupBy",
  "dataSource": "sales_data",
  "intervals": ["2025-04-01T10:00:00Z/2025-04-01T16:00:00Z"],
  "granularity": "all",
  "filter": {
    "type": "selector",
    "dimension": "product",
    "value": "Laptop"
  },
  "dimensions": ["city"],
  "aggregations": [
    { "type": "longSum", "name": "total_sales", "fieldName": "total_sales" }
  ]
}

 




Previous                                                    Next                                                    Home

No comments:

Post a Comment