Programming for beginners: Hive: Dynamic partitioning

Partitioning is a process of keeping related data together based on a partition key.

Let me explain it with an example of order table in an e-commerce company.

orderId	productId	discountCode	countryCode	date
1	p1	d1	IN	2022-04-21
2	p2	d1	IN	2022-04-21
3	p3	d2	USA	2022-04-18

Assume 90% your queries are based on countryCode like

SELECT distinct(discountCode) FROM orders WHERE countryCode = 'IN';

Without partitioning, Hive needs to scan entire data of orders table and find out the discount codes.

What if I group the data by countryCode wise?

If you somehow able to keep the data together by countryCode wise, Hive can scan the data block that is related to the given countryCode. We can achieve this grouping by partitioning the data on column countryCode.

When you partition the data by countryCode column, data is grouped by countryCode.

There are two types of partitioning.

a. Static partitioning: You need to explicitly specify the partition column value on every load of the data.

b. Dynamic partitioning: Partitions are created dynamically.

Dynamic partitioning example

Dynamic partitioning is a four step process.

a. Create a staging table

b. Load the data to staging table

c. Create a partition table

d. Copy the data from staging table to partition table

To perform dynamic partition we need to set below properties in Hive interactive shell.

SET hive.exec.dynamic.partition=true;

SET hive.exec.dynamic.partition.mode=nonstrict;

Create a staging table

CREATE TABLE orders_staging (
    orderId INT,
    productId STRING,
    discountCode STRING,
    timestamp BIGINT,
    countryCode STRING
 )
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;

hive> CREATE TABLE orders_staging (
    >     orderId INT,
    >     productId STRING,
    >     discountCode STRING,
    >     timestamp BIGINT,
    >     countryCode STRING
    >  )
    > ROW FORMAT DELIMITED
    > FIELDS TERMINATED BY ','
    > STORED AS TEXTFILE;
OK
Time taken: 0.038 seconds
hive> ;
hive> DESCRIBE orders_staging;
OK
orderid             	int                 	                    
productid           	string              	                    
discountcode        	string              	                    
timestamp           	bigint              	                    
countrycode         	string              	                    
Time taken: 0.037 seconds, Fetched: 5 row(s)

Load data to orders_staging table

data.csv

1,p1,d1,1650533538101,IN
2,p1,d1,1650533538101,IN
3,p3,d3,1650533538101,IN
4,p9,d3,1650533552595,IN
5,p4,d2,1650533552595,IN
11,p11,d11,1650534429531,USA
12,p11,d11,1650534429531,USA
13,p13,d13,1650534437151,USA
14,p19,d13,1650534437151,USA
15,p14,d12,1650534437151,USA
21,p21,d21,1650534429531,JP
22,p21,d21,1650534429531,JP
23,p23,d23,1650534429531,JP
24,p29,d23,1650534437151,JP
25,p24,d22,1650534437151,JP

Execute below command to load the data from data.csv file to orders_satging table.

LOAD DATA LOCAL INPATH '/home/cloudera/Desktop/examples/data.csv' 
INTO TABLE orders_staging

hive> LOAD DATA LOCAL INPATH '/home/cloudera/Desktop/examples/data.csv' 
    > INTO TABLE orders_staging;
Loading data to table default.orders_staging
Table default.orders_staging stats: [numFiles=1, totalSize=410]
OK
Time taken: 0.146 seconds
hive> ;
hive> ;
hive> SELECT * FROM orders_staging;
OK
1	p1	d1	1650533538101	IN
2	p1	d1	1650533538101	IN
3	p3	d3	1650533538101	IN
4	p9	d3	1650533552595	IN
5	p4	d2	1650533552595	IN
11	p11	d11	1650534429531	USA
12	p11	d11	1650534429531	USA
13	p13	d13	1650534437151	USA
14	p19	d13	1650534437151	USA
15	p14	d12	1650534437151	USA
21	p21	d21	1650534429531	JP
22	p21	d21	1650534429531	JP
23	p23	d23	1650534429531	JP
24	p29	d23	1650534437151	JP
25	p24	d22	1650534437151	JP
Time taken: 0.041 seconds, Fetched: 15 row(s)

Create a partition table

CREATE TABLE orders (
    orderId INT,
    productId STRING,
    discountCode STRING,
    timestamp BIGINT
 )
PARTITIONED BY (countryCode STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;

hive> CREATE TABLE orders (
    >     orderId INT,
    >     productId STRING,
    >     discountCode STRING,
    >     timestamp BIGINT
    >  )
    > PARTITIONED BY (countryCode STRING)
    > ROW FORMAT DELIMITED
    > FIELDS TERMINATED BY ','
    > STORED AS TEXTFILE;
OK
Time taken: 0.068 seconds

Copy the data from staging table to partition table

INSERT INTO TABLE orders
PARTITION (countryCode)
SELECT * FROM orders_staging;

hive> INSERT INTO TABLE orders
    > PARTITION (countryCode)
    > SELECT * FROM orders_staging;
Query ID = cloudera_20220422033939_ca0795ee-acbf-4370-9323-6b6d73a0bf1f
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1649172504056_0038, Tracking URL = http://quickstart.cloudera:8088/proxy/application_1649172504056_0038/
Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_1649172504056_0038
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2022-04-22 03:39:39,838 Stage-1 map = 0%,  reduce = 0%
2022-04-22 03:39:47,447 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.11 sec
MapReduce Total cumulative CPU time: 1 seconds 110 msec
Ended Job = job_1649172504056_0038
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://quickstart.cloudera:8020/user/hive/warehouse/orders/.hive-staging_hive_2022-04-22_03-39-31_414_6216910787424850098-1/-ext-10000
Loading data to table default.orders partition (countrycode=null)
	 Time taken for load dynamic partitions : 453
	Loading partition {countrycode=USA}
	Loading partition {countrycode=IN}
	Loading partition {countrycode=JP}
	 Time taken for adding to write entity : 3
Partition default.orders{countrycode=IN} stats: [numFiles=1, numRows=5, totalSize=110, rawDataSize=105]
Partition default.orders{countrycode=JP} stats: [numFiles=1, numRows=5, totalSize=125, rawDataSize=120]
Partition default.orders{countrycode=USA} stats: [numFiles=1, numRows=5, totalSize=125, rawDataSize=120]
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1   Cumulative CPU: 1.11 sec   HDFS Read: 4865 HDFS Write: 545 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 110 msec
OK
Time taken: 17.826 seconds
hive>

Query the orders table.

hive> SELECT * FROM orders;
OK
1	p1	d1	1650533538101	IN
2	p1	d1	1650533538101	IN
3	p3	d3	1650533538101	IN
4	p9	d3	1650533552595	IN
5	p4	d2	1650533552595	IN
21	p21	d21	1650534429531	JP
22	p21	d21	1650534429531	JP
23	p23	d23	1650534429531	JP
24	p29	d23	1650534437151	JP
25	p24	d22	1650534437151	JP
11	p11	d11	1650534429531	USA
12	p11	d11	1650534429531	USA
13	p13	d13	1650534437151	USA
14	p19	d13	1650534437151	USA
15	p14	d12	1650534437151	USA
Time taken: 0.048 seconds, Fetched: 15 row(s)

Let’s check the orders table HDFS location and see how the data is organized

Execute the command ‘DESCRIBE FORMATTED orders;’ to confirm orders table hdfs location.

hive> DESCRIBE FORMATTED orders;
OK
# col_name            	data_type           	comment             
	 	 
orderid             	int                 	                    
productid           	string              	                    
discountcode        	string              	                    
timestamp           	bigint              	                    
	 	 
# Partition Information	 	 
# col_name            	data_type           	comment             
	 	 
countrycode         	string              	                    
	 	 
# Detailed Table Information	 	 
Database:           	default             	 
Owner:              	cloudera            	 
CreateTime:         	Fri Apr 22 03:38:12 PDT 2022	 
LastAccessTime:     	UNKNOWN             	 
Protect Mode:       	None                	 
Retention:          	0                   	 
Location:           	hdfs://quickstart.cloudera:8020/user/hive/warehouse/orders	 
Table Type:         	MANAGED_TABLE       	 
Table Parameters:	 	 
	numPartitions       	3                   
	transient_lastDdlTime	1650623892          
	 	 
# Storage Information	 	 
SerDe Library:      	org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe	 
InputFormat:        	org.apache.hadoop.mapred.TextInputFormat	 
OutputFormat:       	org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat	 
Compressed:         	No                  	 
Num Buckets:        	-1                  	 
Bucket Columns:     	[]                  	 
Sort Columns:       	[]                  	 
Storage Desc Params:	 	 
	field.delim         	,                   
	serialization.format	,                   
Time taken: 0.046 seconds, Fetched: 36 row(s)
hive>

From the above output, we can confirm that the data stored in the location ‘/user/hive/warehouse/orders’.

Let’s query the folder ‘/user/hive/warehouse/orders’.

$hadoop fs -ls /user/hive/warehouse/orders
Found 3 items
drwxrwxrwx   - cloudera supergroup          0 2022-04-22 03:39 /user/hive/warehouse/orders/countrycode=IN
drwxrwxrwx   - cloudera supergroup          0 2022-04-22 03:39 /user/hive/warehouse/orders/countrycode=JP
drwxrwxrwx   - cloudera supergroup          0 2022-04-22 03:39 /user/hive/warehouse/orders/countrycode=USA

As you see above output, three folder are created.

a. countrycode=IN: Store the country code IN data

b. countrycode=JP: Store the country code JP data

c. countrycode=USA: Store the country code USA data

Let’s print the content and confirm the same.

$hadoop fs -cat /user/hive/warehouse/orders/countrycode=IN/*
1,p1,d1,1650533538101
2,p1,d1,1650533538101
3,p3,d3,1650533538101
4,p9,d3,1650533552595
5,p4,d2,1650533552595
$
$
$hadoop fs -cat /user/hive/warehouse/orders/countrycode=JP/*
21,p21,d21,1650534429531
22,p21,d21,1650534429531
23,p23,d23,1650534429531
24,p29,d23,1650534437151
25,p24,d22,1650534437151
$
$
$hadoop fs -cat /user/hive/warehouse/orders/countrycode=USA/*
11,p11,d11,1650534429531
12,p11,d11,1650534429531
13,p13,d13,1650534437151
14,p19,d13,1650534437151
15,p14,d12,1650534437151

Previous Next Home

Programming for beginners

Wednesday, 28 December 2022

Hive: Dynamic partitioning

No comments:

Post a Comment