Friday, 12 September 2025

How to load data from ORC file into Druid?

What is ORC file format?

ORC stands for Optimized Row Columnar. It's a highly efficient columnar storage format used primarily in the Hadoop ecosystem, especially with Apache Hive and Apache Spark.

 

Enable ORC format in Apache Druid

ORC support is not enabled by default in Apache Druid. To enable ORC format support and load ORC files into Druid, you need to manually add the ORC extension to the common.runtime.properties file and restart the Druid instance.

 

Step-by-Step: Enable ORC Format in Apache Druid

1. Locate the common.runtime.properties file

If you're using the Druid single-server micro-quickstart, the file path is typically:

 

{DRUID_HOME}/conf/druid/single-server/micro-quickstart/_common/common.runtime.properties

 

f you're using other modes like small, nano-quickstart, large, medium, or xlarge, locate the appropriate folder under:

{DRUID_HOME}/conf/druid/single-server/

 

Then navigate into the respective _common/common.runtime.properties.

 

2. Edit the druid.extensions.loadList property

 

Find the following line:

 

druid.extensions.loadList=["druid-hdfs-storage", "druid-kafka-indexing-service", "druid-datasketches", "druid-multi-stage-query", "druid-parquet-extensions"]

 

You need to add "druid-orc-extensions" to the list like this:

 

druid.extensions.loadList=["druid-hdfs-storage", "druid-kafka-indexing-service", "druid-datasketches", "druid-multi-stage-query", "druid-parquet-extensions", "druid-orc-extensions"]

 

3. Restart your Druid instance

After saving the file, restart Druid using:

bin/start-micro-quickstart

 

This ensures Druid loads the ORC extension during startup.

 

Preparing orc file

sales_data.csv

timestamp,product,city,sales
2025-04-01T10:00:00Z,Laptop,Delhi,300
2025-04-01T10:00:00Z,Laptop,Delhi,200
2025-04-01T11:00:00Z,Tablet,Mumbai,150
2025-04-01T11:00:00Z,Tablet,Mumbai,50
2025-04-01T12:00:00Z,Mobile,Bengaluru,200
2025-04-01T13:00:00Z,Laptop,Hyderabad,250
2025-04-01T14:00:00Z,Tablet,Chennai,180
2025-04-01T15:00:00Z,Mobile,Pune,220
2025-04-01T15:00:00Z,Mobile,Pune,80

Navigate to the url https://dataconverter.io/convert/csv-to-orc and convert the csv to orc file, and save it as ‘sales_data.orc’.

 

Loading orc file

Once Druid is running, go to the Druid Web Console http://localhost:8888.

 

Load data -> Batch-SQL multi-stage-query

 


Select Input type as ‘Local disk’.

 


Set the Base directory to the parent directory of orc file.

Give the file name as sales_data.orc and click on Connect data button.

 

You will be taken to Load data / Parse page.

 


Click on Next button. You will be taken to ‘Load data / Configure schema’ tab

 


Click on SQL tab.

 

Give the table name as ‘sales_data_orc’.

 


Click on ‘Start loading data’ button.

 

Upon Ingestion is successful, you will see successful message like below.

 


Navigate to Query tab by clicking on ‘Query: sales_data_orc’ button.

 

Type following sql statement, and run the query.

 

SELECT * FROM "sales_data_orc"

 


That’s it you are done…, Happy learning…

  

Previous                                                    Next                                                    Home

No comments:

Post a Comment