What is the difference between hadoop fs put and copyFromLocal?
February 24, 2021How to skip the first line or header when reading a file in Hive?
March 1, 2021Broadcast Nested Loop join works by broadcasting one of the entire datasets and performing a nested loop to join the data. So essentially every record from dataset 1 is attempted to join with every record from dataset 2.
As you could guess, Broadcast Nested Loop is not preferred and could be quite slow. It works for both equi and non-equi joins and it is picked by default when you have a non-equi join.
Click here if you like to understand the internal workings of Broadcast Nested Loop join.
Do you like us to send you a 47 page Definitive guide on Spark join algorithms? ===>
You can find the type of join algorithm that Spark is using by calling queryExecution.executedPlan on the joined DataFrame. If you see BroadcastNestedLoopJoin in the plan then you can be sure that BroadcastNestedLoopJoin is being used.
scala> dfJoined.queryExecution.executedPlan res2: org.apache.spark.sql.execution.SparkPlan = BroadcastNestedLoopJoin BuildRight, Inner, (id1#3 >= id2#8)
Example & Solution
Below results in a Broadcast Nested Loop join.
df1.join(broadcast(df2), $"id1" === $"id2" || $"id2" === $"id3", "left")
This can be changed into a simple Shuffle Hash Join by breaking the single join with the OR condition to two separate joins and union both the output.
val part1 = df1.join(broadcast(df2), $"id1" === $"id2" ", "left") val part2 = df1.join(broadcast(df2), $"id2" === $"id3", "left") val resultDF = part1.unionByName(part2)
Broadcast Hash Join in Spark works by broadcasting the small dataset to all the executors and once the data is broadcasted a standard hash join is performed in all the executors. Broadcast Hash Join happens in 2 phases.
Broadcast phase – small dataset is broadcasted to all executors
Hash Join phase – small dataset is hashed in all the executors and joined with the partitioned big dataset.
Broadcast Hash Join doesn’t involve a sort operation. Click here to learn more about the internals of Broadcast Hash Join.
Furthermore, the last union statement also does not involve a trigger so you can except the review join statements to perform better than the first one.
1 Comment
[…] How to avoid a Broadcast Nested Loop join in Spark? […]