What is ORC file format?
ORC stands for Optimized Row Columnar. It's a highly efficient columnar storage format used primarily in the Hadoop ecosystem, especially with Apache Hive and Apache Spark.
Enable ORC format in Apache Druid
ORC support is not enabled by default in Apache Druid. To enable ORC format support and load ORC files into Druid, you need to manually add the ORC extension to the common.runtime.properties file and restart the Druid instance.
Step-by-Step: Enable ORC Format in Apache Druid
1. Locate the common.runtime.properties file
If you're using the Druid single-server micro-quickstart, the file path is typically:
{DRUID_HOME}/conf/druid/single-server/micro-quickstart/_common/common.runtime.properties
f you're using other modes like small, nano-quickstart, large, medium, or xlarge, locate the appropriate folder under:
{DRUID_HOME}/conf/druid/single-server/
Then navigate into the respective _common/common.runtime.properties.
2. Edit the druid.extensions.loadList property
Find the following line:
druid.extensions.loadList=["druid-hdfs-storage", "druid-kafka-indexing-service", "druid-datasketches", "druid-multi-stage-query", "druid-parquet-extensions"]
You need to add "druid-orc-extensions" to the list like this:
druid.extensions.loadList=["druid-hdfs-storage", "druid-kafka-indexing-service", "druid-datasketches", "druid-multi-stage-query", "druid-parquet-extensions", "druid-orc-extensions"]
3. Restart your Druid instance
After saving the file, restart Druid using:
bin/start-micro-quickstart
This ensures Druid loads the ORC extension during startup.
Preparing orc file
sales_data.csv
timestamp,product,city,sales 2025-04-01T10:00:00Z,Laptop,Delhi,300 2025-04-01T10:00:00Z,Laptop,Delhi,200 2025-04-01T11:00:00Z,Tablet,Mumbai,150 2025-04-01T11:00:00Z,Tablet,Mumbai,50 2025-04-01T12:00:00Z,Mobile,Bengaluru,200 2025-04-01T13:00:00Z,Laptop,Hyderabad,250 2025-04-01T14:00:00Z,Tablet,Chennai,180 2025-04-01T15:00:00Z,Mobile,Pune,220 2025-04-01T15:00:00Z,Mobile,Pune,80
Navigate to the url https://dataconverter.io/convert/csv-to-orc and convert the csv to orc file, and save it as ‘sales_data.orc’.
Loading orc file
Once Druid is running, go to the Druid Web Console http://localhost:8888.
Load data -> Batch-SQL multi-stage-query
Select Input type as ‘Local disk’.
Set the Base directory to the parent directory of orc file.
Give the file name as sales_data.orc and click on Connect data button.
You will be taken to Load data / Parse page.
Click on Next button. You will be taken to ‘Load data / Configure schema’ tab
Click on SQL tab.
Give the table name as ‘sales_data_orc’.
Click on ‘Start loading data’ button.
Upon Ingestion is successful, you will see successful message like below.
Navigate to Query tab by clicking on ‘Query: sales_data_orc’ button.
Type following sql statement, and run the query.
SELECT * FROM "sales_data_orc"
That’s it you are done…, Happy learning…
Previous Next Home
No comments:
Post a Comment