Partitioning is a process of keeping related data together based on a partition key.
Let me explain it with an example of order table in an e-commerce company.
orderId |
productId |
discountCode |
countryCode |
date |
1 |
p1 |
d1 |
IN |
2022-04-21 |
2 |
p2 |
d1 |
IN |
2022-04-21 |
3 |
p3 |
d2 |
USA |
2022-04-18 |
Assume 90% your queries are based on countryCode like
SELECT distinct(discountCode) FROM orders WHERE countryCode = 'IN';
Without partitioning, Hive needs to scan entire data of orders table and find out the discount codes.
What if I group the data by countryCode wise?
If you somehow able to keep the data together by countryCode wise, Hive can scan the data block that is related to the given countryCode. We can achieve this grouping by partitioning the data on column countryCode.
When you partition the data by countryCode column, data is grouped by countryCode.
There are two types of partitioning.
a. Static partitioning: You need to explicitly specify the partition column value on every load of the data.
b. Dynamic partitioning: Partitions are created dynamically.
Static partitioning example
Step 1: Created orders table.
CREATE TABLE orders (
orderId INT,
productId STRING,
discountCode STRING,
timestamp BIGINT
)
PARTITIONED BY (countryCode STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;
hive> CREATE TABLE orders (
> orderId INT,
> productId STRING,
> discountCode STRING,
> timestamp BIGINT
> )
> PARTITIONED BY (countryCode STRING)
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY ','
> STORED AS TEXTFILE;
OK
Time taken: 0.04 seconds
STEP 2: Load data into orders table.
in.csv
1,p1,d1,1650533538101 2,p1,d1,1650533538101 3,p3,d3,1650533538101 4,p9,d3,1650533552595 5,p4,d2,1650533552595
Load the content of in.cv file to orders table by executing below command.
LOAD DATA LOCAL INPATH '/home/cloudera/Desktop/examples/in.csv' INTO TABLE orders PARTITION (countryCode='IN');
As you observe above snippet, I explicitly mentioned countryCode value while loading the data.
hive> LOAD DATA LOCAL INPATH '/home/cloudera/Desktop/examples/in.csv'
> INTO TABLE orders
> PARTITION (countryCode='IN');
Loading data to table default.orders partition (countrycode=IN)
Partition default.orders{countrycode=IN} stats: [numFiles=1, numRows=0, totalSize=110, rawDataSize=0]
OK
Time taken: 0.196 seconds
hive> ;
hive> ;
hive> SELECT * FROM orders;
OK
1 p1 d1 1650533538101 IN
2 p1 d1 1650533538101 IN
3 p3 d3 1650533538101 IN
4 p9 d3 1650533552595 IN
5 p4 d2 1650533552595 IN
Time taken: 0.042 seconds, Fetched: 5 row(s)
Hive creates a directory for each partition in HDFS and store. Let us confirm the same by querying orders table HDFS location. Execute the command ‘DESCRIBE FORMATTED orders;’ to find orders table HDFS location.
hive> DESCRIBE FORMATTED orders;
OK
# col_name data_type comment
orderid int
productid string
discountcode string
timestamp bigint
# Partition Information
# col_name data_type comment
countrycode string
# Detailed Table Information
Database: default
Owner: cloudera
CreateTime: Thu Apr 21 02:34:27 PDT 2022
LastAccessTime: UNKNOWN
Protect Mode: None
Retention: 0
Location: hdfs://quickstart.cloudera:8020/user/hive/warehouse/orders
Table Type: MANAGED_TABLE
Table Parameters:
numPartitions 1
transient_lastDdlTime 1650533667
# Storage Information
SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
field.delim ,
serialization.format ,
Time taken: 0.043 seconds, Fetched: 36 row(s)
hive>
From the above output, I can confirm that the data is stored in the location ‘/user/hive/warehouse/orders’.
[root@quickstart examples]# hadoop fs -ls /user/hive/warehouse/orders Found 1 items drwxrwxrwx - cloudera supergroup 0 2022-04-21 02:34 /user/hive/warehouse/orders/countrycode=IN [root@quickstart examples]# [root@quickstart examples]# [root@quickstart examples]# hadoop fs -ls /user/hive/warehouse/orders/countrycode=IN Found 1 items -rwxrwxrwx 1 cloudera supergroup 110 2022-04-21 02:34 /user/hive/warehouse/orders/countrycode=IN/in.csv [root@quickstart examples]# [root@quickstart examples]# [root@quickstart examples]# hadoop fs -cat /user/hive/warehouse/orders/countrycode=IN/in.csv 1,p1,d1,1650533538101 2,p1,d1,1650533538101 3,p3,d3,1650533538101 4,p9,d3,1650533552595 5,p4,d2,1650533552595
As you see above snippet, Hive create a folder ‘countrycode=IN’ in orders folder to store the information related to countryCode IN.
Let’s load the data related to the country codes USA and JP
usa.csv
11,p11,d11,1650534429531 12,p11,d11,1650534429531 13,p13,d13,1650534437151 14,p19,d13,1650534437151 15,p14,d12,1650534437151
jp.csv
21,p21,d21,1650534429531 22,p21,d21,1650534429531 23,p23,d23,1650534429531 24,p29,d23,1650534437151 25,p24,d22,1650534437151
Load data by executing below commands.
LOAD DATA LOCAL INPATH '/home/cloudera/Desktop/examples/usa.csv'
INTO TABLE orders
PARTITION (countryCode='USA');
LOAD DATA LOCAL INPATH '/home/cloudera/Desktop/examples/jp.csv'
INTO TABLE orders
PARTITION (countryCode='JP');
hive> LOAD DATA LOCAL INPATH '/home/cloudera/Desktop/examples/usa.csv'
> INTO TABLE orders
> PARTITION (countryCode='USA');
Loading data to table default.orders partition (countrycode=USA)
Partition default.orders{countrycode=USA} stats: [numFiles=1, numRows=0, totalSize=125, rawDataSize=0]
OK
Time taken: 0.196 seconds
hive> ;
hive> ;
hive> LOAD DATA LOCAL INPATH '/home/cloudera/Desktop/examples/jp.csv'
> INTO TABLE orders
> PARTITION (countryCode='JP');
Loading data to table default.orders partition (countrycode=JP)
Partition default.orders{countrycode=JP} stats: [numFiles=1, numRows=0, totalSize=125, rawDataSize=0]
OK
Time taken: 0.209 seconds
Query the orders table and confirm the same.
hive> SELECT * FROM orders;
OK
1 p1 d1 1650533538101 IN
2 p1 d1 1650533538101 IN
3 p3 d3 1650533538101 IN
4 p9 d3 1650533552595 IN
5 p4 d2 1650533552595 IN
21 p21 d21 1650534429531 JP
22 p21 d21 1650534429531 JP
23 p23 d23 1650534429531 JP
24 p29 d23 1650534437151 JP
25 p24 d22 1650534437151 JP
11 p11 d11 1650534429531 USA
12 p11 d11 1650534429531 USA
13 p13 d13 1650534437151 USA
14 p19 d13 1650534437151 USA
15 p14 d12 1650534437151 USA
Time taken: 0.076 seconds, Fetched: 15 row(s)
Now when you query the hdfs folder /user/hive/warehouse/orders, you can observe that two more folders created, one for the countryCode USA and other for the countryCode JP.
a. countrycode=JP
b. countrycode=USA
[root@quickstart examples]# hadoop fs -ls /user/hive/warehouse/orders Found 3 items drwxrwxrwx - cloudera supergroup 0 2022-04-21 02:34 /user/hive/warehouse/orders/countrycode=IN drwxrwxrwx - cloudera supergroup 0 2022-04-21 02:45 /user/hive/warehouse/orders/countrycode=JP drwxrwxrwx - cloudera supergroup 0 2022-04-21 02:45 /user/hive/warehouse/orders/countrycode=USA
What is the behaviour when I ran below query?
SELECT distinct(discountCode) FROM orders WHERE countryCode = 'IN';
Since the data is well partitioned, when you run above query, Hive process the data in folder ‘countrycode=IN’ and no need to process entire data.
How to get all the partitions of a table?
Execute the command ‘SHOW PARTITIONS {table_name};’
hive> SHOW PARTITIONS orders;
OK
countrycode=IN
countrycode=JP
countrycode=USA
Time taken: 0.052 seconds, Fetched: 3 row(s)
No comments:
Post a Comment