What is the difference between partitioning and bucketing a table in Hive?

How does Spark decide stages and tasks during execution of a Job?

September 17, 2021

How to list all the columns in a Hive table?

September 22, 2021

Published by Big Data In Real World at September 20, 2021

Partitioning

Let’s take an example of a table named sales storing records of sales on a retail website. You could create a partition column on the sale_date. When you do that Hive creates a partition of each sale_date.

/sales/sale_date=2020-10-20
Sale records for sale date 2020-10-20 stay here


/sales/sale_date=2020-10-21
Sale records for sale date 2020-10-21 stay here

The benefit is when you execute the below query filtering on the partition column, sale_date 2020-10-21, Hive will read and process the date only from /sales/sale_date=2020-10-21 folder and the result is very good performance.

SELECT * FROM sales 
where sale_date='2020-10-21'

Now let’s extend the above example. Let’s say in addition to filter on the sale_date, you also often filter on the country. So you decide to partition the table based on country. At most you could have 195 partitions in a country. And yes, you can partition the table on multiple partition columns.

/sales/sale_date=2020-10-20/country=USA
Sale  records from USA with sale date 2020-10-20 stay here

Below query will be targeted to the above folder and the result is performance.

SELECT * FROM sales 
where sale_date='2020-10-21' and country='USA'

Bucketing

Now let’s say you also filter the sales record by sku (stock-keeping unit aka. barcode) in addition to sale_date and country. Creating a partition on sku will result in many partitions which is not ideal as it might result in uneven and smaller partitions.

Hadoop is not efficient in processing small volumes of data. There is a better way. We can bucket the sales table and use sku as the bucketing column, the value of this column will be hashed by a user-defined number into buckets.

Records with the same sku will always be stored in the same bucket. A bucket can have records from many skus. While creating a table you can specify like

CLUSTERED BY (sku) INTO X BUCKETS;

where X is the number of buckets. Bucketing has several advantages. The number of buckets is fixed so it does not fluctuate with data. If two tables are bucketed by sku, Hive can create a logically correct sampling of data. Bucketing also helps in doing efficient map-side joins etc.

When to partition and when to bucket the data

You need to understand your data and the common access patterns of your data to know when to partition and bucket and which column to partition and bucket your table.

A common pattern is to partition the data at a higher level. Bucket the data inside the partition to group the records into a fixed number of subsets. This will yield you bigger partitions and fixed number of buckets or record groups inside partitions.

Big Data In Real World

We are a group of Big Data engineers who are passionate about Big Data and related Big Data technologies. We have designed, developed, deployed and maintained Big Data applications ranging from batch to real time streaming big data platforms. We have seen a wide range of real world big data problems, implemented some innovative and complex (or simple, depending on how you look at it) solutions.

Comments are closed.

What is the difference between partitioning and bucketing a table in Hive?

How does Spark decide stages and tasks during execution of a Job?

How to list all the columns in a Hive table?

How does Spark decide stages and tasks during execution of a Job?

How to list all the columns in a Hive table?

Partitioning

Bucketing

When to partition and when to bucket the data

Big Data In Real World

Related posts

How to transpose or convert columns to rows in Hive?

How to fail a Hive script based on a condition?

How to delete duplicate data from the Hive table?