How to deal with corrupt files in HDFS?
May 17, 2021What is the difference between Apache Pig and Hive?
May 21, 2021It is a pretty common use case to find the list of duplicate elements or rows in a Spark DataFrame and it is very easy to do with a groupBy() and a count()
Do you like us to send you a 47 page Definitive guide on Spark join algorithms? ===>
Solution
We only have one column in the below dataframe. We first groupBy the column which is named value by default. groupBy followed by a count will add a second column listing the number of times the value was repeated.
Once you have the column with the count, filter on count to find the records with count greater than 1.
With our sample data we have 20 repeated 2 times and 30 repeated 3 times.
scala> import spark.implicits._ import spark.implicits._ scala> val data1 = Seq(10, 20, 20, 30, 30, 30, 40) data1: Seq[Int] = List(10, 20, 20, 30, 30, 30, 40) scala> val df1 = data1.toDF() df1: org.apache.spark.sql.DataFrame = [value: int] scala> val dupes = df1.groupBy("value").count.filter("count > 1") dupes: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [value: int, count: bigint] scala> dupes.show +-----+-----+ |value|count| +-----+-----+ | 20| 2| | 30| 3| +-----+-----+