Programming for beginners: Exploring Apache Druid Native Queries

Apache Druid is a real-time analytics database designed for fast slice-and-dice analytics on large datasets. While SQL is a convenient way to query data in Druid, native queries provide greater flexibility, control, and performance tuning.

In this blog post, we’ll explore how to use Druid’s scan native query to extract raw event-level data from a sales_data datasource. We'll walk through examples using real-world sales data.

Sample Data: sales_data

I ingested the following CSV into Druid with the name sales_data.

timestamp,product,city,total_sales
2025-04-01T10:00:00Z,Laptop,Delhi,300
2025-04-01T10:00:00Z,Laptop,Delhi,200
2025-04-01T11:00:00Z,Tablet,Mumbai,150
2025-04-01T11:00:00Z,Tablet,Mumbai,50
2025-04-01T12:00:00Z,Mobile,Bengaluru,200
2025-04-01T13:00:00Z,Laptop,Hyderabad,250
2025-04-01T14:00:00Z,Tablet,Chennai,180
2025-04-01T15:00:00Z,Mobile,Pune,220
2025-04-01T15:00:00Z,Mobile,Pune,80

Each row represents a product sale at a specific timestamp, city, and quantity.

1. Scan Query in Druid

The scan query in Druid is used to fetch raw data from your datasource. Unlike groupBy or timeseries queries that aggregate data, scan queries return unprocessed rows in a customizable format.

Here’s a basic scan query that fetches all data from the sales_data datasource between two timestamps.

{
  "queryType": "scan",
  "dataSource": "sales_data",
  "resultFormat": "compactedList",
  "columns": ["__time", "product", "city", "total_sales"],
  "intervals": ["2025-04-01T10:00:00Z/2025-04-01T16:00:00Z"],
  "batchSize": 35575,
  "limit": 100
}

Following table summarizes the native query payload.

Field	Purpose
queryType	Specifies the query type. In this case, scan
dataSource	The name of the Druid datasource (your table)
resultFormat	Format of the result: compactedList, list, or valueVector
columns	List of columns to include in the response
intervals	Time interval filter, ISO 8601 format
batchSize	Number of rows fetched internally per scan
limit	Maximum number of rows returned in the result

Output json looks like below.

Output

[
  {
    "segmentId": "sales_data_2025-04-01T00:00:00.000Z_2025-04-02T00:00:00.000Z_2025-04-18T08:18:17.761Z",
    "columns": [
      "__time",
      "product",
      "city",
      "total_sales"
    ],
    "events": [
      [
        1743501600000,
        "Laptop",
        "Delhi",
        300
      ],
      [
        1743501600000,
        "Laptop",
        "Delhi",
        200
      ],
      [
        1743505200000,
        "Tablet",
        "Mumbai",
        150
      ],
      [
        1743505200000,
        "Tablet",
        "Mumbai",
        50
      ],
      [
        1743508800000,
        "Mobile",
        "Bengaluru",
        200
      ],
      [
        1743512400000,
        "Laptop",
        "Hyderabad",
        250
      ],
      [
        1743516000000,
        "Tablet",
        "Chennai",
        180
      ],
      [
        1743519600000,
        "Mobile",
        "Pune",
        220
      ],
      [
        1743519600000,
        "Mobile",
        "Pune",
        80
      ]
    ],
    "rowSignature": [
      {
        "name": "__time",
        "type": "LONG"
      },
      {
        "name": "product",
        "type": "STRING"
      },
      {
        "name": "city",
        "type": "STRING"
      },
      {
        "name": "total_sales",
        "type": "DOUBLE"
      }
    ]
  }
]

Example 1: Retrieve all rows where sales happened in a specific hour

{
  "queryType": "scan",
  "dataSource": "sales_data",
  "resultFormat": "compactedList",
  "columns": ["__time", "product", "city", "total_sales"],
  "intervals": ["2025-04-01T11:00:00Z/2025-04-01T12:00:00Z"]
}

Example 2: Filter specific columns only (e.g., product, city and total_sales)

{
  "queryType": "scan",
  "dataSource": "sales_data",
  "resultFormat": "list",
  "columns": ["product", "city", "total_sales"],
  "intervals": ["2025-04-01T11:00:00Z/2025-04-01T12:00:00Z"]
}

Example 3: Limit the number of records returned.

{
  "queryType": "scan",
  "dataSource": "sales_data",
  "resultFormat": "compactedList",
  "columns": ["__time", "product", "total_sales"],
  "intervals": ["2025-04-01T10:00:00Z/2025-04-01T16:00:00Z"],
  "limit": 3
}

Example 4: Using resultFormat as list.

{
  "queryType": "scan",
  "dataSource": "sales_data",
  "resultFormat": "list",
  "columns": ["__time", "city", "total_sales"],
  "intervals": ["2025-04-01T13:00:00Z/2025-04-01T16:00:00Z"]
}

2. Druid GroupBy Native Queries with Filters

While scan queries in Druid return raw data, groupBy queries are used to aggregate data much like SQL GROUP BY. This is useful when you want to answer questions like:

· What is the total sales per product?

· What is the sales count by city per hour?

· How much did each city sell after 2 PM?

Example 1: Total Sales by Product

{
  "queryType": "groupBy",
  "dataSource": "sales_data",
  "intervals": ["2025-04-01T10:00:00Z/2025-04-01T16:00:00Z"],
  "granularity": "all",
  "dimensions": ["product"],
  "aggregations": [
    { "type": "longSum", "name": "sales_by_product", "fieldName": "total_sales" }
  ]
}

Example 2: Sales per City after 1 PM

{
  "queryType": "groupBy",
  "dataSource": "sales_data",
  "intervals": ["2025-04-01T13:00:00Z/2025-04-01T16:00:00Z"],
  "granularity": "all",
  "dimensions": ["city"],
  "aggregations": [
    { "type": "longSum", "name": "sales_per_city", "fieldName": "total_sales" }
  ]
}

Example 3: Add a WHERE Clause (Filter by product = 'Laptop')

{
  "queryType": "groupBy",
  "dataSource": "sales_data",
  "intervals": ["2025-04-01T10:00:00Z/2025-04-01T16:00:00Z"],
  "granularity": "all",
  "filter": {
    "type": "selector",
    "dimension": "product",
    "value": "Laptop"
  },
  "dimensions": ["city"],
  "aggregations": [
    { "type": "longSum", "name": "total_sales", "fieldName": "total_sales" }
  ]
}

Previous Next Home

Programming for beginners

Tuesday, 16 September 2025

Exploring Apache Druid Native Queries

No comments:

Post a Comment