Wednesday 28 December 2022

Hive: static partitioning

Partitioning is a process of keeping related data together based on a partition key.

 

Let me explain it with an example of order table in an e-commerce company.

 

orderId

productId

discountCode

countryCode

date

1

p1

d1

IN

2022-04-21

2

p2

d1

IN

2022-04-21

3

p3

d2

USA

2022-04-18

 

Assume 90% your queries are based on countryCode like


SELECT distinct(discountCode) FROM orders WHERE countryCode = 'IN';

 

Without partitioning, Hive needs to scan entire data of orders table and find out the discount codes.

What if I group the data by countryCode wise?

If you somehow able to keep the data together by countryCode wise, Hive can scan the data block that is related to the given countryCode. We can achieve this grouping by partitioning the data on column countryCode.

 


 

When you partition the data by countryCode column, data is grouped by countryCode.

 

There are two types of partitioning.

a.   Static partitioning: You need to explicitly specify the partition column value on every load of the data.

b.   Dynamic partitioning: Partitions are created dynamically.

 

Static partitioning example

Step 1: Created orders table.

CREATE TABLE orders (
    orderId INT,
    productId STRING,
    discountCode STRING,
    timestamp BIGINT
 )
PARTITIONED BY (countryCode STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;

hive> CREATE TABLE orders (
    >     orderId INT,
    >     productId STRING,
    >     discountCode STRING,
    >     timestamp BIGINT
    >  )
    > PARTITIONED BY (countryCode STRING)
    > ROW FORMAT DELIMITED
    > FIELDS TERMINATED BY ','
    > STORED AS TEXTFILE;
OK
Time taken: 0.04 seconds

STEP 2: Load data into orders table.

in.csv

1,p1,d1,1650533538101
2,p1,d1,1650533538101
3,p3,d3,1650533538101
4,p9,d3,1650533552595
5,p4,d2,1650533552595

Load the content of in.cv file to orders table by executing below command.

LOAD DATA LOCAL INPATH '/home/cloudera/Desktop/examples/in.csv' 
INTO TABLE orders 
PARTITION (countryCode='IN');

As you observe above snippet, I explicitly mentioned countryCode value while loading the data.

hive> LOAD DATA LOCAL INPATH '/home/cloudera/Desktop/examples/in.csv' 
    > INTO TABLE orders 
    > PARTITION (countryCode='IN');
Loading data to table default.orders partition (countrycode=IN)
Partition default.orders{countrycode=IN} stats: [numFiles=1, numRows=0, totalSize=110, rawDataSize=0]
OK
Time taken: 0.196 seconds
hive> ;
hive> ;
hive> SELECT * FROM orders;
OK
1	p1	d1	1650533538101	IN
2	p1	d1	1650533538101	IN
3	p3	d3	1650533538101	IN
4	p9	d3	1650533552595	IN
5	p4	d2	1650533552595	IN
Time taken: 0.042 seconds, Fetched: 5 row(s)

Hive creates a directory for each partition in HDFS and store. Let us confirm the same by querying orders table HDFS location. Execute the command ‘DESCRIBE FORMATTED orders;’ to find orders table HDFS location.

hive> DESCRIBE FORMATTED orders;
OK
# col_name            	data_type           	comment             
	 	 
orderid             	int                 	                    
productid           	string              	                    
discountcode        	string              	                    
timestamp           	bigint              	                    
	 	 
# Partition Information	 	 
# col_name            	data_type           	comment             
	 	 
countrycode         	string              	                    
	 	 
# Detailed Table Information	 	 
Database:           	default             	 
Owner:              	cloudera            	 
CreateTime:         	Thu Apr 21 02:34:27 PDT 2022	 
LastAccessTime:     	UNKNOWN             	 
Protect Mode:       	None                	 
Retention:          	0                   	 
Location:           	hdfs://quickstart.cloudera:8020/user/hive/warehouse/orders	 
Table Type:         	MANAGED_TABLE       	 
Table Parameters:	 	 
	numPartitions       	1                   
	transient_lastDdlTime	1650533667          
	 	 
# Storage Information	 	 
SerDe Library:      	org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe	 
InputFormat:        	org.apache.hadoop.mapred.TextInputFormat	 
OutputFormat:       	org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat	 
Compressed:         	No                  	 
Num Buckets:        	-1                  	 
Bucket Columns:     	[]                  	 
Sort Columns:       	[]                  	 
Storage Desc Params:	 	 
	field.delim         	,                   
	serialization.format	,                   
Time taken: 0.043 seconds, Fetched: 36 row(s)
hive>

From the above output, I can confirm that the data is stored in the location ‘/user/hive/warehouse/orders’.

[root@quickstart examples]# hadoop fs -ls /user/hive/warehouse/orders
Found 1 items
drwxrwxrwx   - cloudera supergroup          0 2022-04-21 02:34 /user/hive/warehouse/orders/countrycode=IN
[root@quickstart examples]# 
[root@quickstart examples]# 
[root@quickstart examples]# hadoop fs -ls /user/hive/warehouse/orders/countrycode=IN
Found 1 items
-rwxrwxrwx   1 cloudera supergroup        110 2022-04-21 02:34 /user/hive/warehouse/orders/countrycode=IN/in.csv
[root@quickstart examples]# 
[root@quickstart examples]# 
[root@quickstart examples]# hadoop fs -cat /user/hive/warehouse/orders/countrycode=IN/in.csv
1,p1,d1,1650533538101
2,p1,d1,1650533538101
3,p3,d3,1650533538101
4,p9,d3,1650533552595
5,p4,d2,1650533552595

As you see above snippet, Hive create a folder ‘countrycode=IN’ in orders folder to store the information related to countryCode IN.

Let’s load the data related to the country codes USA and JP

usa.csv

11,p11,d11,1650534429531
12,p11,d11,1650534429531
13,p13,d13,1650534437151
14,p19,d13,1650534437151
15,p14,d12,1650534437151

jp.csv

21,p21,d21,1650534429531
22,p21,d21,1650534429531
23,p23,d23,1650534429531
24,p29,d23,1650534437151
25,p24,d22,1650534437151

Load data by executing below commands.

LOAD DATA LOCAL INPATH '/home/cloudera/Desktop/examples/usa.csv' 
INTO TABLE orders 
PARTITION (countryCode='USA');

LOAD DATA LOCAL INPATH '/home/cloudera/Desktop/examples/jp.csv' 
INTO TABLE orders 
PARTITION (countryCode='JP');
hive> LOAD DATA LOCAL INPATH '/home/cloudera/Desktop/examples/usa.csv' 
    > INTO TABLE orders 
    > PARTITION (countryCode='USA');
Loading data to table default.orders partition (countrycode=USA)
Partition default.orders{countrycode=USA} stats: [numFiles=1, numRows=0, totalSize=125, rawDataSize=0]
OK
Time taken: 0.196 seconds
hive> ;
hive> ;
hive> LOAD DATA LOCAL INPATH '/home/cloudera/Desktop/examples/jp.csv' 
    > INTO TABLE orders 
    > PARTITION (countryCode='JP');
Loading data to table default.orders partition (countrycode=JP)
Partition default.orders{countrycode=JP} stats: [numFiles=1, numRows=0, totalSize=125, rawDataSize=0]
OK
Time taken: 0.209 seconds

Query the orders table and confirm the same.

hive> SELECT * FROM orders;
OK
1	p1	d1	1650533538101	IN
2	p1	d1	1650533538101	IN
3	p3	d3	1650533538101	IN
4	p9	d3	1650533552595	IN
5	p4	d2	1650533552595	IN
21	p21	d21	1650534429531	JP
22	p21	d21	1650534429531	JP
23	p23	d23	1650534429531	JP
24	p29	d23	1650534437151	JP
25	p24	d22	1650534437151	JP
11	p11	d11	1650534429531	USA
12	p11	d11	1650534429531	USA
13	p13	d13	1650534437151	USA
14	p19	d13	1650534437151	USA
15	p14	d12	1650534437151	USA
Time taken: 0.076 seconds, Fetched: 15 row(s)

Now when you query the hdfs folder /user/hive/warehouse/orders, you can observe that two more folders created, one for the countryCode USA and other for the countryCode JP.

a.   countrycode=JP

b.   countrycode=USA

 

[root@quickstart examples]# hadoop fs -ls /user/hive/warehouse/orders
Found 3 items
drwxrwxrwx   - cloudera supergroup          0 2022-04-21 02:34 /user/hive/warehouse/orders/countrycode=IN
drwxrwxrwx   - cloudera supergroup          0 2022-04-21 02:45 /user/hive/warehouse/orders/countrycode=JP
drwxrwxrwx   - cloudera supergroup          0 2022-04-21 02:45 /user/hive/warehouse/orders/countrycode=USA

What is the behaviour when I ran below query?

SELECT distinct(discountCode) FROM orders WHERE countryCode = 'IN';

 

Since the data is well partitioned, when you run above query, Hive process the data in folder ‘countrycode=IN’ and no need to process entire data.

 

How to get all the partitions of a table?

Execute the command ‘SHOW PARTITIONS {table_name};’

hive> SHOW PARTITIONS orders;
OK
countrycode=IN
countrycode=JP
countrycode=USA
Time taken: 0.052 seconds, Fetched: 3 row(s)



Previous                                                    Next                                                    Home

No comments:

Post a Comment