How does Spark decide stages and tasks during execution of a Job?
September 17, 2021How to list all the columns in a Hive table?
September 22, 2021Both partitioning and bucketing are techniques in Hive to organize the data efficiently so subsequent executions on the data works with optimal performance.
Partitioning
Let’s take an example of a table named sales storing records of sales on a retail website. You could create a partition column on the sale_date. When you do that Hive creates a partition of each sale_date.
/sales/sale_date=2020-10-20 Sale records for sale date 2020-10-20 stay here /sales/sale_date=2020-10-21 Sale records for sale date 2020-10-21 stay here
The benefit is when you execute the below query filtering on the partition column, sale_date 2020-10-21, Hive will read and process the date only from /sales/sale_date=2020-10-21 folder and the result is very good performance.
SELECT * FROM sales where sale_date='2020-10-21'
Now let’s extend the above example. Let’s say in addition to filter on the sale_date, you also often filter on the country. So you decide to partition the table based on country. At most you could have 195 partitions in a country. And yes, you can partition the table on multiple partition columns.
/sales/sale_date=2020-10-20/country=USA Sale records from USA with sale date 2020-10-20 stay here
Below query will be targeted to the above folder and the result is performance.
SELECT * FROM sales where sale_date='2020-10-21' and country='USA'
Bucketing
Now let’s say you also filter the sales record by sku (stock-keeping unit aka. barcode) in addition to sale_date and country. Creating a partition on sku will result in many partitions which is not ideal as it might result in uneven and smaller partitions.
Hadoop is not efficient in processing small volumes of data. There is a better way. We can bucket the sales table and use sku as the bucketing column, the value of this column will be hashed by a user-defined number into buckets.
Records with the same sku will always be stored in the same bucket. A bucket can have records from many skus. While creating a table you can specify like
CLUSTERED BY (sku) INTO X BUCKETS;
where X is the number of buckets. Bucketing has several advantages. The number of buckets is fixed so it does not fluctuate with data. If two tables are bucketed by sku, Hive can create a logically correct sampling of data. Bucketing also helps in doing efficient map-side joins etc.
When to partition and when to bucket the data
You need to understand your data and the common access patterns of your data to know when to partition and bucket and which column to partition and bucket your table.
A common pattern is to partition the data at a higher level. Bucket the data inside the partition to group the records into a fixed number of subsets. This will yield you bigger partitions and fixed number of buckets or record groups inside partitions.