What is the difference between order by, sort by, cluster by and distribute by in Hive?

How to find the available free space in HDFS?

September 8, 2021

How to get a few lines of data from a file in HDFS?

September 13, 2021

Published by Big Data In Real World at September 10, 2021

ORDER BY <column-name>

Guarantees global ordering.

Only one reducer is used to perform this operation.

Will not work for large datasets as you might end up with Out of Memory exceptions.

If successful, you will get a full sorted output across the dataset.

SORT BY <column-name>

If the column name is a stock symbol in a stocks dataset, multiple reducers can receive records for APPL symbol and you will find APPL symbol in multiple output files there by not ensuring global ordering but the contents in the output file from a reducer will be ordered.

Doesn’t give you global ordering

Content within each output file is sorted but doesn’t give you global ordering.

Good for large datasets that do not require a global ordering.

CLUSTER BY <column-name>

If the column name is a stock symbol in a stocks dataset, multiple reducers can not receive APPL symbol and you will find APPL symbol in only output files from one reducer there by ensuring global ordering.

Gives you global ordering

Good for large datasets that require global ordering.

DISTRIBUTE BY <column-name> and SORT BY <column-name>

Same as CLUSTER BY

Big Data In Real World

We are a group of Big Data engineers who are passionate about Big Data and related Big Data technologies. We have designed, developed, deployed and maintained Big Data applications ranging from batch to real time streaming big data platforms. We have seen a wide range of real world big data problems, implemented some innovative and complex (or simple, depending on how you look at it) solutions.

1 Comment

How to select top N rows in Hive? - Hadoop In Real World says:

October 8, 2021 at 6:03 am

[…] Check out this post on differences between ORDER BY, SORT BY in Hive. […]

What is the difference between order by, sort by, cluster by and distribute by in Hive?

How to find the available free space in HDFS?

How to get a few lines of data from a file in HDFS?

How to find the available free space in HDFS?

How to get a few lines of data from a file in HDFS?

ORDER BY <column-name>

SORT BY <column-name>

CLUSTER BY <column-name>

DISTRIBUTE BY <column-name> and SORT BY <column-name>

Big Data In Real World

Related posts

How to transpose or convert columns to rows in Hive?

How to fail a Hive script based on a condition?

How to delete duplicate data from the Hive table?

1 Comment