Navigate to Druid console by opening the url (http://localhost:8888) in browser.
Load data -> Batch - classic
Apache Druid installation comes with some sample datasets to start experimenting. These datasets are available in apache-druid-32.0.1/quickstart/tutorial/ folder.
$ls apache-druid-32.0.1/quickstart/tutorial/ compaction-day-granularity.json transform-data.json updates-overwrite-index.json compaction-init-index.json transform-index.json wikipedia-index-hadoop3.json compaction-keep-granularity.json updates-append-index.json wikipedia-index.json deletion-index.json updates-append-index2.json wikipedia-kafka-supervisor.json deletion-kill.json updates-data.json wikipedia-top-pages-sql.json hadoop updates-data2.json wikipedia-top-pages.json retention-index.json updates-data3.json wikiticker-2015-09-12-sampled.json.gz rollup-data.json updates-data4.json rollup-index.json updates-init-index.json
Let’s use the file wikiticker-2015-09-12-sampled.json.gz to start with.
Select ‘Local disk’ and click on Connect data button.
Since we are going to refer the file in Druid installation directory, we can give the relative path for the Base Directory as quickstart/tutorial, Set the file name wikiticker-2015-09-12-sampled.json.gz in File filter field. Click on Apply button.
If everything goes well, you can see the json records from the file.
At bottom right corner of the same page, you can see a button ‘Next: Parse data’, click on this.
You will be taken to ‘Parse data’ tab. The "Parse data" tab in Apache Druid is one of the first steps in the data ingestion wizard (via the web console at http://localhost:8888). It helps you to preview, configure, and transform your raw data before ingestion into Druid.
Click on the button Parse time available at bottom right corner to proceed to parse time tab.
The Parse time tab is where you configure:
· Which field (column) in your data represents time
· What format that time is in
Druid needs this because it’s a time-first database, it requires a special column called __time, which is used to:
· Partition data
· Index efficiently
· Query quickly
· Filter using time ranges
From the above image, you can observe that Druid automatically took the time column as __time. Let’s keep this as it is and click on Transform button available at bottom right corner of the page. You will be taken to Transform tab.
The Transform tab allows you to create new derived columns from your raw input data before it’s ingested into Druid. These transformations happen:
· Before filtering
· Before ingestion
· Before storage
So, it's a pre-processing stage. Think of it as "cleaning and preparing" your data as it flows into Druid.
Let’s keep it as it and click on Filter button available at bottom right corner. The Filter tab in Apache Druid's ingestion wizard is where you define conditions to exclude rows of data before they're ingested.
Let’s proceed further by clicking on ‘Configure Schema’ button available at bottom right corner. The Configure Schema tab in Apache Druid’s web console is where you:
· Define which fields in your data are dimensions
· Pick or create metrics/aggregators
· Configure timestamp handling
· Choose how the data should be partitioned and indexed
This step is all about structuring your data so Druid can ingest, index, and query it efficiently.
Let’s proceed further by clicking on the Partition button.
The Partition tab is all about:
· How your data is split into segments (files stored by Druid)
· How much data goes into each segment
· Whether or not segments are further partitioned internally
· Which time ranges your ingestion covers
· This affects query performance, storage layout, and parallelism during ingestion.
Select Segment granularity as day and click on Tune button.
You will be taken to Tune tab. The Tune tab is where you configure how much power, memory, and error forgiveness Druid uses while ingesting your data.
· How many rows do I buffer before flushing?
· How many errors am I allowed?
· How many tasks should run in parallel?
· How much memory should each task use?
Click on Publish button. The Publish tab defines metadata-level configuration for the ingested data:
· What should the final data source be called?
· Should it overwrite existing data or append to it?
· What visibility should it have (for queries)?
· How should it be stored?
Think of this as the part where Druid wraps up the ingestion process and publishes the results for querying and storage.
Let’s give the datasource name as wikiticker-local-db.
Click on Edit spec button at bottom right corner. You will be taken to Edit spec tab.
The Edit Spec tab allows you to:
· View the entire ingestion spec as raw JSON
· Edit any part of the spec (schema, tuning, transforms, filters, etc.)
· Fine-tune or customize things that may not be exposed via the UI
· Save or reuse the spec for automating future ingestion jobs
This is very helpful for advanced users or anyone who wants to automate ingestion.
Click on the Submit task button available at bottom right corner of Edit spec tab.
Upon successful submission of the task, you will be taken to the Tasks tab.
Give it sometime to finish the task and reload the page, you can see Task status as SUCCESS.
Click on the Query button available at top (Horizontal Navigation Bar). You will be taken to Query page.
What Is the "Query" page?
The Query page is where you can:
· Test queries on the ingested data
· Explore your data quickly
· Run different types of Druid queries such as timeseries, groupBy, select, and scan
· View the results of those queries to make sure your ingestion worked as expected
This page is highly useful for interacting with your data without writing code or using external tools.
You can expand the file ‘wikiticker-local-db’ to see all the columns in it.
Let’s execute the following query (Data source must be mentioned in double quotes)
SELECT * FROM "wikiticker-local-db"
Upon successful execution of the query, you can see the first page results.
Example 1: Fetching the First 5 Records
You can retrieve a few rows from the datasource with the following query.
SELECT * FROM "wikiticker-local-db" LIMIT 5
This will return the first 5 rows in the wikiticker-local-db table. It’s useful for inspecting the data when you don’t want to overwhelm yourself with too many records.
Example 2: Timeseries Query, Count of Edits Over Time
To get a count of the edits made over time, you can perform a timeseries query. This aggregates the data by timestamp and counts the number of edits for each time bucket.
SELECT __time, COUNT(*) AS edit_count FROM "wikiticker-local-db" WHERE __time BETWEEN '2015-09-12T00:00:00.000Z' AND '2015-09-12T23:59:59.999Z' GROUP BY __time
Example 3: GroupBy Query, Count Edits by Country
Let’s say you want to count how many edits were made from different countries. You can group by the countryIsoCode dimension to see the number of edits from each country.
SELECT countryIsoCode, COUNT(*) AS edit_count FROM "wikiticker-local-db" GROUP BY countryIsoCode ORDER BY edit_count DESC LIMIT 10
Example 4: GroupBy Query, Number of Edits by User
If you want to find out how many edits each user has made, you can group the data by user and count the number of edits per user.
SELECT user, COUNT(*) AS edit_count FROM "wikiticker-local-db" WHERE user IS NOT NULL GROUP BY user ORDER BY edit_count DESC LIMIT 10
This query will return the top 10 users by the number of edits they have made.
Example 5: Select Query, Retrieve Specific Fields
You might want to retrieve only specific fields, such as the channel, cityName, and comment fields, to inspect the content of specific edits. Here’s a select query for that:
SELECT channel, cityName, comment FROM "wikiticker-local-db" LIMIT 5
This query will return the first 5 records with only the specified columns (channel, cityName, comment).
Example 6: Filter Query, Fetch Edits by Specific User
If you're interested in fetching all records made by a specific user (e.g., "GELongstreet"), you can filter the data:
SELECT * FROM "wikiticker-local-db" WHERE user = 'GELongstreet' LIMIT 10
This query will return the first 10 records where the user is "GELongstreet". You can adjust the limit to retrieve more or fewer records.
Example 7: TopN Query: Top 5 Cities with the Most Edits
You can use the TopN query type to get the top 5 cities that had the most edits. The TopN query is useful for when you want to get the highest-ranked records based on some metric.
SELECT cityName, COUNT(*) AS edit_count FROM "wikiticker-local-db" GROUP BY cityName ORDER BY edit_count DESC LIMIT 5
This will return the top 5 cities based on the number of edits.
Example 8: Scan Query, Fetch Raw Data Quickly
The scan query type provides a low-latency, unaggregated view of the data. It is typically used for fast access to raw data without any aggregation.
SELECT * FROM "wikiticker-local-db" WHERE cityName = 'New York' LIMIT 5
This will return the first 5 records where the cityName is "New York".
Example 9: Filter by Time Interval, Data for a Specific Day
If you want to get the records for a specific day, say September 12, 2015, you can use the following query.
SELECT * FROM "wikiticker-local-db" WHERE __time BETWEEN '2015-09-12T00:00:00.000Z' AND '2015-09-12T23:59:59.999Z' LIMIT 5
This query will return the first 5 records from September 12, 2015.
These queries provide a good starting point to understand the core capabilities of Apache Druid for analyzing and querying your data in real time.
Happy learning….
No comments:
Post a Comment