LeaseExpiredException: No lease error on HDFS
August 23, 2021How to calculate median in Hive?
August 27, 2021Pretty simple. Use the except() to subtract or find the difference between two dataframes.
Do you like us to send you a 47 page Definitive guide on Spark join algorithms? ===>
Solution
except() will list the elements that are in dataframe 1 and not in dataframe 2. except() will still remove an element even if the element is listed multiple times in dataframe 1 and only once in dataframe 2.
import spark.implicits._ scala> val data1 = Seq(10, 20, 20, 30, 40) data1: Seq[Int] = List(10, 20, 20, 30, 40) scala> val data2 = Seq(20, 30) data2: Seq[Int] = List(20, 30) scala> val df1 = data1.toDF() df1: org.apache.spark.sql.DataFrame = [value: int] scala> val df2 = data2.toDF() df2: org.apache.spark.sql.DataFrame = [value: int] scala> df1.except(df2).show +-----+ |value| +-----+ | 40| | 10| +-----+ scala> df2.except(df1).show +-----+ |value| +-----+ +-----+