Friday, 30 December 2022

Hive: Perform partitioning on multiple columns

In this post, I am going to explain how to partition the dataset using two columns.

 

Suppose, I have below dataset to track the users that are using my website.

 

userId

totalTimeOnSite

timestamp

countryCode

browser

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

When I partition the data by countryCode and browser (chrome,firefox,edge,safari etc.,), Hive create below directory structure in HDFS.

 

.
|____USA
| |____firefox
| |____safari
| |____chrome
|____IN
| |____firefox
| |____safari
| |____chrome

 

As you see above snippet,

a.   Folders with country code USA, IN are created and

b.   firefox, safari, chrome are the folders created in USA and IN folders.

 

Let’s experiment it with an example.

 

Step 1: Set below properties in Hive shell.

SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
hive> SET hive.exec.dynamic.partition=true;
hive> SET hive.exec.dynamic.partition.mode=nonstrict;

 

Step 2: Create an user_visit_time_staging table.

 

CREATE TABLE user_visit_time_staging (
    userId INT,
    totalTimeOnSite DOUBLE,
    timestamp BIGINT,
    countryCode STRING,
    browser STRING
 )
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;

hive> CREATE TABLE user_visit_time_staging (
    >     userId INT,
    >     totalTimeOnSite DOUBLE,
    >     timestamp BIGINT,
    >     countryCode STRING,
    >     browser STRING
    >  )
    > ROW FORMAT DELIMITED
    > FIELDS TERMINATED BY ','
    > STORED AS TEXTFILE;
OK
Time taken: 0.049 seconds
hive> ;
hive> ;
hive> DESC user_visit_time_staging;
OK
userid              	int                 	                    
totaltimeonsite     	double              	                    
timestamp           	bigint              	                    
countrycode         	string              	                    
browser             	string              	                    
Time taken: 0.042 seconds, Fetched: 5 row(s)

 

Step 3: Load visits.csv file to user_staging table.

 

visits.csv

1,2.34,1650533538101,IN,firefox
2,4.31,1650534429531,IN,chrome
3,1.2,1650534437151,USA,safari
4,10.2,1650534437151,JP,edge
5,4.56,1650534437345,USA,chrome
6,5.34,1650534437151,JP,firefox
7,10.2,1650534437151,JP,edge
8,4.56,1650534437345,USA,chrome
9,5.34,165053443715,JP,firefox
7,10.2,1650534437151,JP,edge
3,4.56,1650534437345,USA,chrome
4,5.34,1650534437151,JP,firefox

 

Load the data into user_visit_time_staging table by executing below command.

LOAD DATA LOCAL INPATH '/home/cloudera/Desktop/examples/visits.csv' 
INTO TABLE user_visit_time_staging;

hive> LOAD DATA LOCAL INPATH '/home/cloudera/Desktop/examples/visits.csv' 
    > INTO TABLE user_visit_time_staging;
Loading data to table default.user_visit_time_staging
Table default.user_visit_time_staging stats: [numFiles=1, totalSize=372]
OK
Time taken: 0.151 seconds
hive> ;
hive> ;
hive> SELECT * FROM user_visit_time_staging;
OK
1	2.34	1650533538101	IN	firefox
2	4.31	1650534429531	IN	chrome
3	1.2	1650534437151	USA	safari
4	10.2	1650534437151	JP	edge
5	4.56	1650534437345	USA	chrome
6	5.34	1650534437151	JP	firefox
7	10.2	1650534437151	JP	edge
8	4.56	1650534437345	USA	chrome
9	5.34	165053443715	JP	firefox
7	10.2	1650534437151	JP	edge
3	4.56	1650534437345	USA	chrome
4	5.34	1650534437151	JP	firefox
Time taken: 0.044 seconds, Fetched: 12 row(s)

 

Step 4: Create user_visit_time table and mention the partition columns at the time of table creation time.

CREATE TABLE user_visit_time (
    userId INT,
    totalTimeOnSite DOUBLE,
    timestamp BIGINT
 )
PARTITIONED BY (countryCode STRING, browser STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;

hive> CREATE TABLE user_visit_time (
    >     userId INT,
    >     totalTimeOnSite DOUBLE,
    >     timestamp BIGINT
    >  )
    > PARTITIONED BY (countryCode STRING, browser STRING)
    > ROW FORMAT DELIMITED
    > FIELDS TERMINATED BY ','
    > STORED AS TEXTFILE;
OK
Time taken: 0.052 seconds

 

Step 5: Load the data from user_visit_time_staging table to user_visit_time table by executing below command.

INSERT INTO TABLE user_visit_time
PARTITION (countryCode, browser)
SELECT * FROM user_visit_time_staging;

hive> INSERT INTO TABLE user_visit_time
    > PARTITION (countryCode, browser)
    > SELECT * FROM user_visit_time_staging;
Query ID = root_20220424053131_66b41846-292d-4e09-bdd3-c592f436ef4a
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1649172504056_0041, Tracking URL = http://quickstart.cloudera:8088/proxy/application_1649172504056_0041/
Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_1649172504056_0041
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2022-04-24 05:31:52,746 Stage-1 map = 0%,  reduce = 0%
2022-04-24 05:32:00,296 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.09 sec
MapReduce Total cumulative CPU time: 1 seconds 90 msec
Ended Job = job_1649172504056_0041
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://quickstart.cloudera:8020/user/hive/warehouse/user_visit_time/.hive-staging_hive_2022-04-24_05-31-45_324_4936454984801614245-1/-ext-10000
Loading data to table default.user_visit_time partition (countrycode=null, browser=null)
	 Time taken for load dynamic partitions : 527
	Loading partition {countrycode=JP, browser=edge}
	Loading partition {countrycode=USA, browser=safari}
	Loading partition {countrycode=IN, browser=firefox}
	Loading partition {countrycode=JP, browser=firefox}
	Loading partition {countrycode=IN, browser=chrome}
	Loading partition {countrycode=USA, browser=chrome}
	 Time taken for adding to write entity : 1
Partition default.user_visit_time{countrycode=IN, browser=chrome} stats: [numFiles=1, numRows=1, totalSize=21, rawDataSize=20]
Partition default.user_visit_time{countrycode=IN, browser=firefox} stats: [numFiles=1, numRows=1, totalSize=21, rawDataSize=20]
Partition default.user_visit_time{countrycode=JP, browser=edge} stats: [numFiles=1, numRows=3, totalSize=63, rawDataSize=60]
Partition default.user_visit_time{countrycode=JP, browser=firefox} stats: [numFiles=1, numRows=3, totalSize=62, rawDataSize=59]
Partition default.user_visit_time{countrycode=USA, browser=chrome} stats: [numFiles=1, numRows=3, totalSize=63, rawDataSize=60]
Partition default.user_visit_time{countrycode=USA, browser=safari} stats: [numFiles=1, numRows=1, totalSize=20, rawDataSize=19]
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1   Cumulative CPU: 1.09 sec   HDFS Read: 5042 HDFS Write: 721 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 90 msec
OK
Time taken: 16.809 seconds
hive>

 

Let’s query ‘user_visit_time’ table.

hive> SELECT * FROM user_visit_time;
OK
2	4.31	1650534429531	IN	chrome
1	2.34	1650533538101	IN	firefox
4	10.2	1650534437151	JP	edge
7	10.2	1650534437151	JP	edge
7	10.2	1650534437151	JP	edge
6	5.34	1650534437151	JP	firefox
9	5.34	165053443715	JP	firefox
4	5.34	1650534437151	JP	firefox
5	4.56	1650534437345	USA	chrome
8	4.56	1650534437345	USA	chrome
3	4.56	1650534437345	USA	chrome
3	1.2	1650534437151	USA	safari
Time taken: 0.069 seconds, Fetched: 12 row(s)

 

Let’s query HDFS location of user_visit_time table and confirm the folder hierarchy.

 

How to find the HDFS location of user_visit_time table?

Execute the command ‘DESCRIBE FORMATTED user_visit_time’

hive> DESCRIBE FORMATTED user_visit_time;
OK
# col_name            	data_type           	comment             
	 	 
userid              	int                 	                    
totaltimeonsite     	double              	                    
timestamp           	bigint              	                    
	 	 
# Partition Information	 	 
# col_name            	data_type           	comment             
	 	 
countrycode         	string              	                    
browser             	string              	                    
	 	 
# Detailed Table Information	 	 
Database:           	default             	 
Owner:              	root                	 
CreateTime:         	Sun Apr 24 05:31:26 PDT 2022	 
LastAccessTime:     	UNKNOWN             	 
Protect Mode:       	None                	 
Retention:          	0                   	 
Location:           	hdfs://quickstart.cloudera:8020/user/hive/warehouse/user_visit_time	 
Table Type:         	MANAGED_TABLE       	 
Table Parameters:	 	 
	numPartitions       	6                   
	transient_lastDdlTime	1650803486          
	 	 
# Storage Information	 	 
SerDe Library:      	org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe	 
InputFormat:        	org.apache.hadoop.mapred.TextInputFormat	 
OutputFormat:       	org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat	 
Compressed:         	No                  	 
Num Buckets:        	-1                  	 
Bucket Columns:     	[]                  	 
Sort Columns:       	[]                  	 
Storage Desc Params:	 	 
	field.delim         	,                   
	serialization.format	,                   
Time taken: 0.086 seconds, Fetched: 36 row(s)
hive>

 

From the command output, you can confirm that Hive store the content at location ‘/user/hive/warehouse/user_visit_time’.

 

Let’s query the HDFS folder structure.

$hadoop fs -ls /user/hive/warehouse/user_visit_time
Found 3 items
drwxrwxrwx   - root supergroup          0 2022-04-24 05:32 /user/hive/warehouse/user_visit_time/countrycode=IN
drwxrwxrwx   - root supergroup          0 2022-04-24 05:32 /user/hive/warehouse/user_visit_time/countrycode=JP
drwxrwxrwx   - root supergroup          0 2022-04-24 05:32 /user/hive/warehouse/user_visit_time/countrycode=USA
$
$
$hadoop fs -ls /user/hive/warehouse/user_visit_time/countrycode=IN
Found 2 items
drwxrwxrwx   - root supergroup          0 2022-04-24 05:31 /user/hive/warehouse/user_visit_time/countrycode=IN/browser=chrome
drwxrwxrwx   - root supergroup          0 2022-04-24 05:31 /user/hive/warehouse/user_visit_time/countrycode=IN/browser=firefox
$
$hadoop fs -ls /user/hive/warehouse/user_visit_time/countrycode=JP
Found 2 items
drwxrwxrwx   - root supergroup          0 2022-04-24 05:31 /user/hive/warehouse/user_visit_time/countrycode=JP/browser=edge
drwxrwxrwx   - root supergroup          0 2022-04-24 05:31 /user/hive/warehouse/user_visit_time/countrycode=JP/browser=firefox
$
$
$hadoop fs -ls /user/hive/warehouse/user_visit_time/countrycode=USA
Found 2 items
drwxrwxrwx   - root supergroup          0 2022-04-24 05:31 /user/hive/warehouse/user_visit_time/countrycode=USA/browser=chrome
drwxrwxrwx   - root supergroup          0 2022-04-24 05:31 /user/hive/warehouse/user_visit_time/countrycode=USA/browser=safari

 

Let’s find the user visits who are staying in USA and on browser chrome.

$hadoop fs -cat /user/hive/warehouse/user_visit_time/countrycode=USA/browser=chrome/*
5,4.56,1650534437345
8,4.56,1650534437345
3,4.56,1650534437345

 

 

Previous                                                    Next                                                    Home

No comments:

Post a Comment