Hadoop In Real World is now Big Data In Real World!
February 25, 2023How to rename files or objects in Amazon S3?
March 9, 2023This is a common problem if you are bringing data from a legacy system or simply from a system which you don’t have control over. First identify duplicates and make sure those are the records you would like to delete and then delete the records.
Identify duplicates
This is a standard way of identifying the duplicates in a table. Make sure to group by on all columns which identify uniqueness. This query groups the data by the listed column and counts the number of records. Finally use the having clause to select records with count greater than 1.
select [all columns], count(*) from sales group by [all columns] having count(*) > 1;
Remove duplicates
Here we are doing a simple SELECT with a distinct and an INSERT. Note that when we are selecting and inserting the records from and to the same table.
insert overwrite table sales select distinct * from sales;