In this post, I am going to explain how to partition the dataset using two columns.
Suppose, I have below dataset to track the users that are using my website.
userId |
totalTimeOnSite |
timestamp |
countryCode |
browser |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
When I partition the data by countryCode and browser (chrome,firefox,edge,safari etc.,), Hive create below directory structure in HDFS.
. |____USA | |____firefox | |____safari | |____chrome |____IN | |____firefox | |____safari | |____chrome
As you see above snippet,
a. Folders with country code USA, IN are created and
b. firefox, safari, chrome are the folders created in USA and IN folders.
Let’s experiment it with an example.
Step 1: Set below properties in Hive shell.
SET hive.exec.dynamic.partition=true; SET hive.exec.dynamic.partition.mode=nonstrict;
hive> SET hive.exec.dynamic.partition=true;
hive> SET hive.exec.dynamic.partition.mode=nonstrict;
Step 2: Create an user_visit_time_staging table.
CREATE TABLE user_visit_time_staging (
userId INT,
totalTimeOnSite DOUBLE,
timestamp BIGINT,
countryCode STRING,
browser STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;
hive> CREATE TABLE user_visit_time_staging (
> userId INT,
> totalTimeOnSite DOUBLE,
> timestamp BIGINT,
> countryCode STRING,
> browser STRING
> )
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY ','
> STORED AS TEXTFILE;
OK
Time taken: 0.049 seconds
hive> ;
hive> ;
hive> DESC user_visit_time_staging;
OK
userid int
totaltimeonsite double
timestamp bigint
countrycode string
browser string
Time taken: 0.042 seconds, Fetched: 5 row(s)
Step 3: Load visits.csv file to user_staging table.
visits.csv
1,2.34,1650533538101,IN,firefox 2,4.31,1650534429531,IN,chrome 3,1.2,1650534437151,USA,safari 4,10.2,1650534437151,JP,edge 5,4.56,1650534437345,USA,chrome 6,5.34,1650534437151,JP,firefox 7,10.2,1650534437151,JP,edge 8,4.56,1650534437345,USA,chrome 9,5.34,165053443715,JP,firefox 7,10.2,1650534437151,JP,edge 3,4.56,1650534437345,USA,chrome 4,5.34,1650534437151,JP,firefox
Load the data into user_visit_time_staging table by executing below command.
LOAD DATA LOCAL INPATH '/home/cloudera/Desktop/examples/visits.csv'
INTO TABLE user_visit_time_staging;
hive> LOAD DATA LOCAL INPATH '/home/cloudera/Desktop/examples/visits.csv'
> INTO TABLE user_visit_time_staging;
Loading data to table default.user_visit_time_staging
Table default.user_visit_time_staging stats: [numFiles=1, totalSize=372]
OK
Time taken: 0.151 seconds
hive> ;
hive> ;
hive> SELECT * FROM user_visit_time_staging;
OK
1 2.34 1650533538101 IN firefox
2 4.31 1650534429531 IN chrome
3 1.2 1650534437151 USA safari
4 10.2 1650534437151 JP edge
5 4.56 1650534437345 USA chrome
6 5.34 1650534437151 JP firefox
7 10.2 1650534437151 JP edge
8 4.56 1650534437345 USA chrome
9 5.34 165053443715 JP firefox
7 10.2 1650534437151 JP edge
3 4.56 1650534437345 USA chrome
4 5.34 1650534437151 JP firefox
Time taken: 0.044 seconds, Fetched: 12 row(s)
Step 4: Create user_visit_time table and mention the partition columns at the time of table creation time.
CREATE TABLE user_visit_time (
userId INT,
totalTimeOnSite DOUBLE,
timestamp BIGINT
)
PARTITIONED BY (countryCode STRING, browser STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;
hive> CREATE TABLE user_visit_time (
> userId INT,
> totalTimeOnSite DOUBLE,
> timestamp BIGINT
> )
> PARTITIONED BY (countryCode STRING, browser STRING)
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY ','
> STORED AS TEXTFILE;
OK
Time taken: 0.052 seconds
Step 5: Load the data from user_visit_time_staging table to user_visit_time table by executing below command.
INSERT INTO TABLE user_visit_time
PARTITION (countryCode, browser)
SELECT * FROM user_visit_time_staging;
hive> INSERT INTO TABLE user_visit_time
> PARTITION (countryCode, browser)
> SELECT * FROM user_visit_time_staging;
Query ID = root_20220424053131_66b41846-292d-4e09-bdd3-c592f436ef4a
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1649172504056_0041, Tracking URL = http://quickstart.cloudera:8088/proxy/application_1649172504056_0041/
Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_1649172504056_0041
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2022-04-24 05:31:52,746 Stage-1 map = 0%, reduce = 0%
2022-04-24 05:32:00,296 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.09 sec
MapReduce Total cumulative CPU time: 1 seconds 90 msec
Ended Job = job_1649172504056_0041
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://quickstart.cloudera:8020/user/hive/warehouse/user_visit_time/.hive-staging_hive_2022-04-24_05-31-45_324_4936454984801614245-1/-ext-10000
Loading data to table default.user_visit_time partition (countrycode=null, browser=null)
Time taken for load dynamic partitions : 527
Loading partition {countrycode=JP, browser=edge}
Loading partition {countrycode=USA, browser=safari}
Loading partition {countrycode=IN, browser=firefox}
Loading partition {countrycode=JP, browser=firefox}
Loading partition {countrycode=IN, browser=chrome}
Loading partition {countrycode=USA, browser=chrome}
Time taken for adding to write entity : 1
Partition default.user_visit_time{countrycode=IN, browser=chrome} stats: [numFiles=1, numRows=1, totalSize=21, rawDataSize=20]
Partition default.user_visit_time{countrycode=IN, browser=firefox} stats: [numFiles=1, numRows=1, totalSize=21, rawDataSize=20]
Partition default.user_visit_time{countrycode=JP, browser=edge} stats: [numFiles=1, numRows=3, totalSize=63, rawDataSize=60]
Partition default.user_visit_time{countrycode=JP, browser=firefox} stats: [numFiles=1, numRows=3, totalSize=62, rawDataSize=59]
Partition default.user_visit_time{countrycode=USA, browser=chrome} stats: [numFiles=1, numRows=3, totalSize=63, rawDataSize=60]
Partition default.user_visit_time{countrycode=USA, browser=safari} stats: [numFiles=1, numRows=1, totalSize=20, rawDataSize=19]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 1.09 sec HDFS Read: 5042 HDFS Write: 721 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 90 msec
OK
Time taken: 16.809 seconds
hive>
Let’s query ‘user_visit_time’ table.
hive> SELECT * FROM user_visit_time;
OK
2 4.31 1650534429531 IN chrome
1 2.34 1650533538101 IN firefox
4 10.2 1650534437151 JP edge
7 10.2 1650534437151 JP edge
7 10.2 1650534437151 JP edge
6 5.34 1650534437151 JP firefox
9 5.34 165053443715 JP firefox
4 5.34 1650534437151 JP firefox
5 4.56 1650534437345 USA chrome
8 4.56 1650534437345 USA chrome
3 4.56 1650534437345 USA chrome
3 1.2 1650534437151 USA safari
Time taken: 0.069 seconds, Fetched: 12 row(s)
Let’s query HDFS location of user_visit_time table and confirm the folder hierarchy.
How to find the HDFS location of user_visit_time table?
Execute the command ‘DESCRIBE FORMATTED user_visit_time’
hive> DESCRIBE FORMATTED user_visit_time;
OK
# col_name data_type comment
userid int
totaltimeonsite double
timestamp bigint
# Partition Information
# col_name data_type comment
countrycode string
browser string
# Detailed Table Information
Database: default
Owner: root
CreateTime: Sun Apr 24 05:31:26 PDT 2022
LastAccessTime: UNKNOWN
Protect Mode: None
Retention: 0
Location: hdfs://quickstart.cloudera:8020/user/hive/warehouse/user_visit_time
Table Type: MANAGED_TABLE
Table Parameters:
numPartitions 6
transient_lastDdlTime 1650803486
# Storage Information
SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
field.delim ,
serialization.format ,
Time taken: 0.086 seconds, Fetched: 36 row(s)
hive>
From the command output, you can confirm that Hive store the content at location ‘/user/hive/warehouse/user_visit_time’.
Let’s query the HDFS folder structure.
$hadoop fs -ls /user/hive/warehouse/user_visit_time Found 3 items drwxrwxrwx - root supergroup 0 2022-04-24 05:32 /user/hive/warehouse/user_visit_time/countrycode=IN drwxrwxrwx - root supergroup 0 2022-04-24 05:32 /user/hive/warehouse/user_visit_time/countrycode=JP drwxrwxrwx - root supergroup 0 2022-04-24 05:32 /user/hive/warehouse/user_visit_time/countrycode=USA $ $ $hadoop fs -ls /user/hive/warehouse/user_visit_time/countrycode=IN Found 2 items drwxrwxrwx - root supergroup 0 2022-04-24 05:31 /user/hive/warehouse/user_visit_time/countrycode=IN/browser=chrome drwxrwxrwx - root supergroup 0 2022-04-24 05:31 /user/hive/warehouse/user_visit_time/countrycode=IN/browser=firefox $ $hadoop fs -ls /user/hive/warehouse/user_visit_time/countrycode=JP Found 2 items drwxrwxrwx - root supergroup 0 2022-04-24 05:31 /user/hive/warehouse/user_visit_time/countrycode=JP/browser=edge drwxrwxrwx - root supergroup 0 2022-04-24 05:31 /user/hive/warehouse/user_visit_time/countrycode=JP/browser=firefox $ $ $hadoop fs -ls /user/hive/warehouse/user_visit_time/countrycode=USA Found 2 items drwxrwxrwx - root supergroup 0 2022-04-24 05:31 /user/hive/warehouse/user_visit_time/countrycode=USA/browser=chrome drwxrwxrwx - root supergroup 0 2022-04-24 05:31 /user/hive/warehouse/user_visit_time/countrycode=USA/browser=safari
Let’s find the user visits who are staying in USA and on browser chrome.
$hadoop fs -cat /user/hive/warehouse/user_visit_time/countrycode=USA/browser=chrome/*
5,4.56,1650534437345
8,4.56,1650534437345
3,4.56,1650534437345
No comments:
Post a Comment