Programming for beginners: How to load data from Druid Console?

Navigate to Druid console by opening the url (http://localhost:8888) in browser.

Load data -> Batch - classic

Apache Druid installation comes with some sample datasets to start experimenting. These datasets are available in apache-druid-32.0.1/quickstart/tutorial/ folder.

$ls apache-druid-32.0.1/quickstart/tutorial/
compaction-day-granularity.json		transform-data.json			updates-overwrite-index.json
compaction-init-index.json		transform-index.json			wikipedia-index-hadoop3.json
compaction-keep-granularity.json	updates-append-index.json		wikipedia-index.json
deletion-index.json			updates-append-index2.json		wikipedia-kafka-supervisor.json
deletion-kill.json			updates-data.json			wikipedia-top-pages-sql.json
hadoop					updates-data2.json			wikipedia-top-pages.json
retention-index.json			updates-data3.json			wikiticker-2015-09-12-sampled.json.gz
rollup-data.json			updates-data4.json
rollup-index.json			updates-init-index.json

Let’s use the file wikiticker-2015-09-12-sampled.json.gz to start with.

Select ‘Local disk’ and click on Connect data button.

Since we are going to refer the file in Druid installation directory, we can give the relative path for the Base Directory as quickstart/tutorial, Set the file name wikiticker-2015-09-12-sampled.json.gz in File filter field. Click on Apply button.

If everything goes well, you can see the json records from the file.

At bottom right corner of the same page, you can see a button ‘Next: Parse data’, click on this.

You will be taken to ‘Parse data’ tab. The "Parse data" tab in Apache Druid is one of the first steps in the data ingestion wizard (via the web console at http://localhost:8888). It helps you to preview, configure, and transform your raw data before ingestion into Druid.

Click on the button Parse time available at bottom right corner to proceed to parse time tab.

The Parse time tab is where you configure:

· Which field (column) in your data represents time

· What format that time is in

Druid needs this because it’s a time-first database, it requires a special column called __time, which is used to:

· Partition data

· Index efficiently

· Query quickly

· Filter using time ranges

From the above image, you can observe that Druid automatically took the time column as __time. Let’s keep this as it is and click on Transform button available at bottom right corner of the page. You will be taken to Transform tab.

The Transform tab allows you to create new derived columns from your raw input data before it’s ingested into Druid. These transformations happen:

· Before filtering

· Before ingestion

· Before storage

So, it's a pre-processing stage. Think of it as "cleaning and preparing" your data as it flows into Druid.

Let’s keep it as it and click on Filter button available at bottom right corner. The Filter tab in Apache Druid's ingestion wizard is where you define conditions to exclude rows of data before they're ingested.

Let’s proceed further by clicking on ‘Configure Schema’ button available at bottom right corner. The Configure Schema tab in Apache Druid’s web console is where you:

· Define which fields in your data are dimensions

· Pick or create metrics/aggregators

· Configure timestamp handling

· Choose how the data should be partitioned and indexed

This step is all about structuring your data so Druid can ingest, index, and query it efficiently.

Let’s proceed further by clicking on the Partition button.

The Partition tab is all about:

· How your data is split into segments (files stored by Druid)

· How much data goes into each segment

· Whether or not segments are further partitioned internally

· Which time ranges your ingestion covers

· This affects query performance, storage layout, and parallelism during ingestion.

Select Segment granularity as day and click on Tune button.

You will be taken to Tune tab. The Tune tab is where you configure how much power, memory, and error forgiveness Druid uses while ingesting your data.

· How many rows do I buffer before flushing?

· How many errors am I allowed?

· How many tasks should run in parallel?

· How much memory should each task use?

Click on Publish button. The Publish tab defines metadata-level configuration for the ingested data:

· What should the final data source be called?

· Should it overwrite existing data or append to it?

· What visibility should it have (for queries)?

· How should it be stored?

Think of this as the part where Druid wraps up the ingestion process and publishes the results for querying and storage.

Let’s give the datasource name as wikiticker-local-db.

Click on Edit spec button at bottom right corner. You will be taken to Edit spec tab.

The Edit Spec tab allows you to:

· View the entire ingestion spec as raw JSON

· Edit any part of the spec (schema, tuning, transforms, filters, etc.)

· Fine-tune or customize things that may not be exposed via the UI

· Save or reuse the spec for automating future ingestion jobs

This is very helpful for advanced users or anyone who wants to automate ingestion.

Click on the Submit task button available at bottom right corner of Edit spec tab.

Upon successful submission of the task, you will be taken to the Tasks tab.

Give it sometime to finish the task and reload the page, you can see Task status as SUCCESS.

Click on the Query button available at top (Horizontal Navigation Bar). You will be taken to Query page.

What Is the "Query" page?

The Query page is where you can:

· Test queries on the ingested data

· Explore your data quickly

· Run different types of Druid queries such as timeseries, groupBy, select, and scan

· View the results of those queries to make sure your ingestion worked as expected

This page is highly useful for interacting with your data without writing code or using external tools.

You can expand the file ‘wikiticker-local-db’ to see all the columns in it.

Let’s execute the following query (Data source must be mentioned in double quotes)

SELECT * FROM "wikiticker-local-db"

Upon successful execution of the query, you can see the first page results.

Example 1: Fetching the First 5 Records

You can retrieve a few rows from the datasource with the following query.

SELECT * FROM "wikiticker-local-db" LIMIT 5

This will return the first 5 rows in the wikiticker-local-db table. It’s useful for inspecting the data when you don’t want to overwhelm yourself with too many records.

Example 2: Timeseries Query, Count of Edits Over Time

To get a count of the edits made over time, you can perform a timeseries query. This aggregates the data by timestamp and counts the number of edits for each time bucket.

SELECT 
  __time,
  COUNT(*) AS edit_count
FROM "wikiticker-local-db"
WHERE __time BETWEEN '2015-09-12T00:00:00.000Z' AND '2015-09-12T23:59:59.999Z'
GROUP BY __time

Example 3: GroupBy Query, Count Edits by Country

Let’s say you want to count how many edits were made from different countries. You can group by the countryIsoCode dimension to see the number of edits from each country.

SELECT 
  countryIsoCode,
  COUNT(*) AS edit_count
FROM "wikiticker-local-db"
GROUP BY countryIsoCode
ORDER BY edit_count DESC
LIMIT 10

Example 4: GroupBy Query, Number of Edits by User

If you want to find out how many edits each user has made, you can group the data by user and count the number of edits per user.

SELECT 
  user,
  COUNT(*) AS edit_count
FROM "wikiticker-local-db"
WHERE user IS NOT NULL
GROUP BY user
ORDER BY edit_count DESC
LIMIT 10

This query will return the top 10 users by the number of edits they have made.

Example 5: Select Query, Retrieve Specific Fields

You might want to retrieve only specific fields, such as the channel, cityName, and comment fields, to inspect the content of specific edits. Here’s a select query for that:

SELECT 
  channel,
  cityName,
  comment
FROM "wikiticker-local-db"
LIMIT 5

This query will return the first 5 records with only the specified columns (channel, cityName, comment).

Example 6: Filter Query, Fetch Edits by Specific User

If you're interested in fetching all records made by a specific user (e.g., "GELongstreet"), you can filter the data:

SELECT *
FROM "wikiticker-local-db"
WHERE user = 'GELongstreet'
LIMIT 10

This query will return the first 10 records where the user is "GELongstreet". You can adjust the limit to retrieve more or fewer records.

Example 7: TopN Query: Top 5 Cities with the Most Edits

You can use the TopN query type to get the top 5 cities that had the most edits. The TopN query is useful for when you want to get the highest-ranked records based on some metric.

SELECT 
  cityName,
  COUNT(*) AS edit_count
FROM "wikiticker-local-db"
GROUP BY cityName
ORDER BY edit_count DESC
LIMIT 5

This will return the top 5 cities based on the number of edits.

Example 8: Scan Query, Fetch Raw Data Quickly

The scan query type provides a low-latency, unaggregated view of the data. It is typically used for fast access to raw data without any aggregation.

SELECT *
FROM "wikiticker-local-db"
WHERE cityName = 'New York'
LIMIT 5

This will return the first 5 records where the cityName is "New York".

Example 9: Filter by Time Interval, Data for a Specific Day

If you want to get the records for a specific day, say September 12, 2015, you can use the following query.

SELECT * 
FROM "wikiticker-local-db"
WHERE __time BETWEEN '2015-09-12T00:00:00.000Z' AND '2015-09-12T23:59:59.999Z'
LIMIT 5

This query will return the first 5 records from September 12, 2015.

These queries provide a good starting point to understand the core capabilities of Apache Druid for analyzing and querying your data in real time.

Happy learning….

Previous Next Home

Programming for beginners

Thursday, 7 August 2025

How to load data from Druid Console?

No comments:

Post a Comment