Programming for beginners: Quick guide to partitioning in Hive

Partitioning is a process of keeping related data together based on a partition key.

Let me explain it with an example of order table in an e-commerce company.

orderId	productId	discountCode	countryCode	date
1	p1	d1	IN	2022-04-21
2	p2	d1	IN	2022-04-21
3	p3	d2	USA	2022-04-18

Assume 90% your queries are based on countryCode like

SELECT distinct(disscountCode) FROM orders WHERE countryCode = {COUNTRY_CODE}

Without partitioning, Hive needs to scan entire data of orders table and find out the discount codes.

What if I group the data by countryCode wise?

If you somehow able to keep the data together by countryCode wise, Hive can scan the data block that is related to the given countryCode. We can achieve this grouping by partitioning the data on column countryCode.

When you partition the data by countryCode column, data is grouped by countryCode.

There are two types of partitioning.

a. Static partitioning: You need to explicitly specify the partition column value on every load of the data.

b. Dynamic partitioning: Partitions are created dynamically.

I will cover the examples of static and dynamic partitions in my later posts.

Previous Next Home

Programming for beginners

Tuesday, 27 December 2022

Quick guide to partitioning in Hive

No comments:

Post a Comment