How to delete duplicate data from the Hive table? - Big Data In Real World

How to delete duplicate data from the Hive table?

Hadoop In Real World is now Big Data In Real World!
February 25, 2023
How to rename files or objects in Amazon S3?
March 9, 2023
Hadoop In Real World is now Big Data In Real World!
February 25, 2023
How to rename files or objects in Amazon S3?
March 9, 2023

This is a common problem if you are bringing data from a legacy system or simply from a system which you don’t have control over. First identify duplicates and make sure those are the records you would like to delete and then delete the records.

Identify duplicates

This is a standard way of identifying the duplicates in a table. Make sure to group by on all columns which identify uniqueness. This query groups the data by the listed column and counts the number of records. Finally use the having clause to select records with count greater than 1.

select [all columns], count(*) from sales
group by [all columns] 
having count(*) > 1;

Remove duplicates

Here we are doing a simple SELECT with a distinct and an INSERT. Note that when we are selecting and inserting the records from and to the same table.

insert overwrite table sales
select distinct * from sales;
Big Data In Real World
Big Data In Real World
We are a group of Big Data engineers who are passionate about Big Data and related Big Data technologies. We have designed, developed, deployed and maintained Big Data applications ranging from batch to real time streaming big data platforms. We have seen a wide range of real world big data problems, implemented some innovative and complex (or simple, depending on how you look at it) solutions.

Comments are closed.

How to delete duplicate data from the Hive table?
This website uses cookies to improve your experience. By using this website you agree to our Data Protection Policy.

Hadoop In Real World is now Big Data In Real World!

X