Partitioning is a process of keeping related data together based on a partition key.
Let me explain it with an example of order table in an e-commerce company.
orderId |
productId |
discountCode |
countryCode |
date |
1 |
p1 |
d1 |
IN |
2022-04-21 |
2 |
p2 |
d1 |
IN |
2022-04-21 |
3 |
p3 |
d2 |
USA |
2022-04-18 |
Assume 90% your queries are based on countryCode like
SELECT distinct(discountCode) FROM orders WHERE countryCode = 'IN';
Without partitioning, Hive needs to scan entire data of orders table and find out the discount codes.
What if I group the data by countryCode wise?
If you somehow able to keep the data together by countryCode wise, Hive can scan the data block that is related to the given countryCode. We can achieve this grouping by partitioning the data on column countryCode.
When you partition the data by countryCode column, data is grouped by countryCode.
There are two types of partitioning.
a. Static partitioning: You need to explicitly specify the partition column value on every load of the data.
b. Dynamic partitioning: Partitions are created dynamically.
Dynamic partitioning example
Dynamic partitioning is a four step process.
a. Create a staging table
b. Load the data to staging table
c. Create a partition table
d. Copy the data from staging table to partition table
To perform dynamic partition we need to set below properties in Hive interactive shell.
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
Create a staging table
CREATE TABLE orders_staging (
orderId INT,
productId STRING,
discountCode STRING,
timestamp BIGINT,
countryCode STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;
hive> CREATE TABLE orders_staging (
> orderId INT,
> productId STRING,
> discountCode STRING,
> timestamp BIGINT,
> countryCode STRING
> )
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY ','
> STORED AS TEXTFILE;
OK
Time taken: 0.038 seconds
hive> ;
hive> DESCRIBE orders_staging;
OK
orderid int
productid string
discountcode string
timestamp bigint
countrycode string
Time taken: 0.037 seconds, Fetched: 5 row(s)
Load data to orders_staging table
data.csv
1,p1,d1,1650533538101,IN 2,p1,d1,1650533538101,IN 3,p3,d3,1650533538101,IN 4,p9,d3,1650533552595,IN 5,p4,d2,1650533552595,IN 11,p11,d11,1650534429531,USA 12,p11,d11,1650534429531,USA 13,p13,d13,1650534437151,USA 14,p19,d13,1650534437151,USA 15,p14,d12,1650534437151,USA 21,p21,d21,1650534429531,JP 22,p21,d21,1650534429531,JP 23,p23,d23,1650534429531,JP 24,p29,d23,1650534437151,JP 25,p24,d22,1650534437151,JP
Execute below command to load the data from data.csv file to orders_satging table.
LOAD DATA LOCAL INPATH '/home/cloudera/Desktop/examples/data.csv'
INTO TABLE orders_staging
hive> LOAD DATA LOCAL INPATH '/home/cloudera/Desktop/examples/data.csv'
> INTO TABLE orders_staging;
Loading data to table default.orders_staging
Table default.orders_staging stats: [numFiles=1, totalSize=410]
OK
Time taken: 0.146 seconds
hive> ;
hive> ;
hive> SELECT * FROM orders_staging;
OK
1 p1 d1 1650533538101 IN
2 p1 d1 1650533538101 IN
3 p3 d3 1650533538101 IN
4 p9 d3 1650533552595 IN
5 p4 d2 1650533552595 IN
11 p11 d11 1650534429531 USA
12 p11 d11 1650534429531 USA
13 p13 d13 1650534437151 USA
14 p19 d13 1650534437151 USA
15 p14 d12 1650534437151 USA
21 p21 d21 1650534429531 JP
22 p21 d21 1650534429531 JP
23 p23 d23 1650534429531 JP
24 p29 d23 1650534437151 JP
25 p24 d22 1650534437151 JP
Time taken: 0.041 seconds, Fetched: 15 row(s)
Create a partition table
CREATE TABLE orders (
orderId INT,
productId STRING,
discountCode STRING,
timestamp BIGINT
)
PARTITIONED BY (countryCode STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;
hive> CREATE TABLE orders (
> orderId INT,
> productId STRING,
> discountCode STRING,
> timestamp BIGINT
> )
> PARTITIONED BY (countryCode STRING)
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY ','
> STORED AS TEXTFILE;
OK
Time taken: 0.068 seconds
Copy the data from staging table to partition table
INSERT INTO TABLE orders
PARTITION (countryCode)
SELECT * FROM orders_staging;
hive> INSERT INTO TABLE orders
> PARTITION (countryCode)
> SELECT * FROM orders_staging;
Query ID = cloudera_20220422033939_ca0795ee-acbf-4370-9323-6b6d73a0bf1f
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1649172504056_0038, Tracking URL = http://quickstart.cloudera:8088/proxy/application_1649172504056_0038/
Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_1649172504056_0038
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2022-04-22 03:39:39,838 Stage-1 map = 0%, reduce = 0%
2022-04-22 03:39:47,447 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.11 sec
MapReduce Total cumulative CPU time: 1 seconds 110 msec
Ended Job = job_1649172504056_0038
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://quickstart.cloudera:8020/user/hive/warehouse/orders/.hive-staging_hive_2022-04-22_03-39-31_414_6216910787424850098-1/-ext-10000
Loading data to table default.orders partition (countrycode=null)
Time taken for load dynamic partitions : 453
Loading partition {countrycode=USA}
Loading partition {countrycode=IN}
Loading partition {countrycode=JP}
Time taken for adding to write entity : 3
Partition default.orders{countrycode=IN} stats: [numFiles=1, numRows=5, totalSize=110, rawDataSize=105]
Partition default.orders{countrycode=JP} stats: [numFiles=1, numRows=5, totalSize=125, rawDataSize=120]
Partition default.orders{countrycode=USA} stats: [numFiles=1, numRows=5, totalSize=125, rawDataSize=120]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 1.11 sec HDFS Read: 4865 HDFS Write: 545 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 110 msec
OK
Time taken: 17.826 seconds
hive>
Query the orders table.
hive> SELECT * FROM orders;
OK
1 p1 d1 1650533538101 IN
2 p1 d1 1650533538101 IN
3 p3 d3 1650533538101 IN
4 p9 d3 1650533552595 IN
5 p4 d2 1650533552595 IN
21 p21 d21 1650534429531 JP
22 p21 d21 1650534429531 JP
23 p23 d23 1650534429531 JP
24 p29 d23 1650534437151 JP
25 p24 d22 1650534437151 JP
11 p11 d11 1650534429531 USA
12 p11 d11 1650534429531 USA
13 p13 d13 1650534437151 USA
14 p19 d13 1650534437151 USA
15 p14 d12 1650534437151 USA
Time taken: 0.048 seconds, Fetched: 15 row(s)
Let’s check the orders table HDFS location and see how the data is organized
Execute the command ‘DESCRIBE FORMATTED orders;’ to confirm orders table hdfs location.
hive> DESCRIBE FORMATTED orders;
OK
# col_name data_type comment
orderid int
productid string
discountcode string
timestamp bigint
# Partition Information
# col_name data_type comment
countrycode string
# Detailed Table Information
Database: default
Owner: cloudera
CreateTime: Fri Apr 22 03:38:12 PDT 2022
LastAccessTime: UNKNOWN
Protect Mode: None
Retention: 0
Location: hdfs://quickstart.cloudera:8020/user/hive/warehouse/orders
Table Type: MANAGED_TABLE
Table Parameters:
numPartitions 3
transient_lastDdlTime 1650623892
# Storage Information
SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
field.delim ,
serialization.format ,
Time taken: 0.046 seconds, Fetched: 36 row(s)
hive>
From the above output, we can confirm that the data stored in the location ‘/user/hive/warehouse/orders’.
Let’s query the folder ‘/user/hive/warehouse/orders’.
$hadoop fs -ls /user/hive/warehouse/orders Found 3 items drwxrwxrwx - cloudera supergroup 0 2022-04-22 03:39 /user/hive/warehouse/orders/countrycode=IN drwxrwxrwx - cloudera supergroup 0 2022-04-22 03:39 /user/hive/warehouse/orders/countrycode=JP drwxrwxrwx - cloudera supergroup 0 2022-04-22 03:39 /user/hive/warehouse/orders/countrycode=USA
As you see above output, three folder are created.
a. countrycode=IN: Store the country code IN data
b. countrycode=JP: Store the country code JP data
c. countrycode=USA: Store the country code USA data
Let’s print the content and confirm the same.
$hadoop fs -cat /user/hive/warehouse/orders/countrycode=IN/*
1,p1,d1,1650533538101
2,p1,d1,1650533538101
3,p3,d3,1650533538101
4,p9,d3,1650533552595
5,p4,d2,1650533552595
$
$
$hadoop fs -cat /user/hive/warehouse/orders/countrycode=JP/*
21,p21,d21,1650534429531
22,p21,d21,1650534429531
23,p23,d23,1650534429531
24,p29,d23,1650534437151
25,p24,d22,1650534437151
$
$
$hadoop fs -cat /user/hive/warehouse/orders/countrycode=USA/*
11,p11,d11,1650534429531
12,p11,d11,1650534429531
13,p13,d13,1650534437151
14,p19,d13,1650534437151
15,p14,d12,1650534437151
No comments:
Post a Comment