How to find duplicate elements or rows in a Spark DataFrame?

How to deal with corrupt files in HDFS?

May 17, 2021

What is the difference between Apache Pig and Hive?

May 21, 2021

Published by Big Data In Real World at May 19, 2021

Solution

We only have one column in the below dataframe. We first groupBy the column which is named value by default. groupBy followed by a count will add a second column listing the number of times the value was repeated.

Once you have the column with the count, filter on count to find the records with count greater than 1.

With our sample data we have 20 repeated 2 times and 30 repeated 3 times.

scala> import spark.implicits._
import spark.implicits._

scala> val data1 = Seq(10, 20, 20, 30, 30, 30, 40)
data1: Seq[Int] = List(10, 20, 20, 30, 30, 30, 40)

scala> val df1 = data1.toDF()
df1: org.apache.spark.sql.DataFrame = [value: int]

scala> val dupes = df1.groupBy("value").count.filter("count > 1")
dupes: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [value: int, count: bigint]

scala> dupes.show
+-----+-----+
|value|count|
+-----+-----+
|   20|    2|
|   30|    3|
+-----+-----+

Big Data In Real World

We are a group of Big Data engineers who are passionate about Big Data and related Big Data technologies. We have designed, developed, deployed and maintained Big Data applications ranging from batch to real time streaming big data platforms. We have seen a wide range of real world big data problems, implemented some innovative and complex (or simple, depending on how you look at it) solutions.

Comments are closed.

How to find duplicate elements or rows in a Spark DataFrame?

How to deal with corrupt files in HDFS?

What is the difference between Apache Pig and Hive?

How to deal with corrupt files in HDFS?

What is the difference between Apache Pig and Hive?

Solution

Big Data In Real World

Related posts

How to kill a running Spark application?

What is the default number of executors in Spark?

What is the default number of cores and amount of memory allocated to an application in Spark?