Apache Druid is a real-time analytics database designed for fast slice-and-dice analytics on large datasets. While SQL is a convenient way to query data in Druid, native queries provide greater flexibility, control, and performance tuning.
In this blog post, we’ll explore how to use Druid’s scan native query to extract raw event-level data from a sales_data datasource. We'll walk through examples using real-world sales data.
Sample Data: sales_data
I ingested the following CSV into Druid with the name sales_data.
timestamp,product,city,total_sales 2025-04-01T10:00:00Z,Laptop,Delhi,300 2025-04-01T10:00:00Z,Laptop,Delhi,200 2025-04-01T11:00:00Z,Tablet,Mumbai,150 2025-04-01T11:00:00Z,Tablet,Mumbai,50 2025-04-01T12:00:00Z,Mobile,Bengaluru,200 2025-04-01T13:00:00Z,Laptop,Hyderabad,250 2025-04-01T14:00:00Z,Tablet,Chennai,180 2025-04-01T15:00:00Z,Mobile,Pune,220 2025-04-01T15:00:00Z,Mobile,Pune,80
Each row represents a product sale at a specific timestamp, city, and quantity.
1. Scan Query in Druid
The scan query in Druid is used to fetch raw data from your datasource. Unlike groupBy or timeseries queries that aggregate data, scan queries return unprocessed rows in a customizable format.
Here’s a basic scan query that fetches all data from the sales_data datasource between two timestamps.
{ "queryType": "scan", "dataSource": "sales_data", "resultFormat": "compactedList", "columns": ["__time", "product", "city", "total_sales"], "intervals": ["2025-04-01T10:00:00Z/2025-04-01T16:00:00Z"], "batchSize": 35575, "limit": 100 }
Following table summarizes the native query payload.
Field |
Purpose |
queryType |
Specifies the query type. In this case, scan |
dataSource |
The name of the Druid datasource (your table) |
resultFormat |
Format of the result: compactedList, list, or valueVector |
columns |
List of columns to include in the response |
intervals |
Time interval filter, ISO 8601 format |
batchSize |
Number of rows fetched internally per scan |
limit |
Maximum number of rows returned in the result |
Output json looks like below.
Output
[ { "segmentId": "sales_data_2025-04-01T00:00:00.000Z_2025-04-02T00:00:00.000Z_2025-04-18T08:18:17.761Z", "columns": [ "__time", "product", "city", "total_sales" ], "events": [ [ 1743501600000, "Laptop", "Delhi", 300 ], [ 1743501600000, "Laptop", "Delhi", 200 ], [ 1743505200000, "Tablet", "Mumbai", 150 ], [ 1743505200000, "Tablet", "Mumbai", 50 ], [ 1743508800000, "Mobile", "Bengaluru", 200 ], [ 1743512400000, "Laptop", "Hyderabad", 250 ], [ 1743516000000, "Tablet", "Chennai", 180 ], [ 1743519600000, "Mobile", "Pune", 220 ], [ 1743519600000, "Mobile", "Pune", 80 ] ], "rowSignature": [ { "name": "__time", "type": "LONG" }, { "name": "product", "type": "STRING" }, { "name": "city", "type": "STRING" }, { "name": "total_sales", "type": "DOUBLE" } ] } ]
Example 1: Retrieve all rows where sales happened in a specific hour
{ "queryType": "scan", "dataSource": "sales_data", "resultFormat": "compactedList", "columns": ["__time", "product", "city", "total_sales"], "intervals": ["2025-04-01T11:00:00Z/2025-04-01T12:00:00Z"] }
Example 2: Filter specific columns only (e.g., product, city and total_sales)
{ "queryType": "scan", "dataSource": "sales_data", "resultFormat": "list", "columns": ["product", "city", "total_sales"], "intervals": ["2025-04-01T11:00:00Z/2025-04-01T12:00:00Z"] }
Example 3: Limit the number of records returned.
{ "queryType": "scan", "dataSource": "sales_data", "resultFormat": "compactedList", "columns": ["__time", "product", "total_sales"], "intervals": ["2025-04-01T10:00:00Z/2025-04-01T16:00:00Z"], "limit": 3 }
Example 4: Using resultFormat as list.
{ "queryType": "scan", "dataSource": "sales_data", "resultFormat": "list", "columns": ["__time", "city", "total_sales"], "intervals": ["2025-04-01T13:00:00Z/2025-04-01T16:00:00Z"] }
2. Druid GroupBy Native Queries with Filters
While scan queries in Druid return raw data, groupBy queries are used to aggregate data much like SQL GROUP BY. This is useful when you want to answer questions like:
· What is the total sales per product?
· What is the sales count by city per hour?
· How much did each city sell after 2 PM?
Example 1: Total Sales by Product
{ "queryType": "groupBy", "dataSource": "sales_data", "intervals": ["2025-04-01T10:00:00Z/2025-04-01T16:00:00Z"], "granularity": "all", "dimensions": ["product"], "aggregations": [ { "type": "longSum", "name": "sales_by_product", "fieldName": "total_sales" } ] }
Example 2: Sales per City after 1 PM
{ "queryType": "groupBy", "dataSource": "sales_data", "intervals": ["2025-04-01T13:00:00Z/2025-04-01T16:00:00Z"], "granularity": "all", "dimensions": ["city"], "aggregations": [ { "type": "longSum", "name": "sales_per_city", "fieldName": "total_sales" } ] }
Example 3: Add a WHERE Clause (Filter by product = 'Laptop')
{ "queryType": "groupBy", "dataSource": "sales_data", "intervals": ["2025-04-01T10:00:00Z/2025-04-01T16:00:00Z"], "granularity": "all", "filter": { "type": "selector", "dimension": "product", "value": "Laptop" }, "dimensions": ["city"], "aggregations": [ { "type": "longSum", "name": "total_sales", "fieldName": "total_sales" } ] }
Previous Next Home
No comments:
Post a Comment