How to avoid a Broadcast Nested Loop join in Spark? - Big Data In Real World

How to avoid a Broadcast Nested Loop join in Spark?

What is the difference between hadoop fs put and copyFromLocal?
February 24, 2021
How to skip the first line or header when reading a file in Hive?
March 1, 2021
What is the difference between hadoop fs put and copyFromLocal?
February 24, 2021
How to skip the first line or header when reading a file in Hive?
March 1, 2021

Broadcast Nested Loop join works by broadcasting one of the entire datasets and performing a nested loop to join the data. So essentially every record from dataset 1 is attempted to join with every record from dataset 2.

As you could guess, Broadcast Nested Loop is not preferred and could be quite slow. It works for both equi and non-equi joins and it is picked by default when you have a non-equi join.

Click here if you like to understand the internal workings of Broadcast Nested Loop join.

Do you like us to send you a 47 page Definitive guide on Spark join algorithms? ===>

You can find the type of join algorithm that Spark is using by calling queryExecution.executedPlan on the joined DataFrame. If you see BroadcastNestedLoopJoin in the plan then you can be sure that BroadcastNestedLoopJoin is being used.

scala> dfJoined.queryExecution.executedPlan

res2: org.apache.spark.sql.execution.SparkPlan =
BroadcastNestedLoopJoin BuildRight, Inner, (id1#3 >= id2#8)

 

Example & Solution

Below results in a Broadcast Nested Loop join.

df1.join(broadcast(df2), $"id1" === $"id2" || $"id2" === $"id3", "left")

This can be changed into a simple Shuffle Hash Join by breaking the single join with the OR condition to two separate joins and union both the output.

val part1 = df1.join(broadcast(df2), $"id1" === $"id2" ", "left") 
val part2 = df1.join(broadcast(df2), $"id2" === $"id3", "left") 
val resultDF = part1.unionByName(part2)

Broadcast Hash Join in Spark works by broadcasting the small dataset to all the executors and once the data is broadcasted a standard hash join is performed in all the executors. Broadcast Hash Join happens in 2 phases.

Broadcast phase – small dataset is broadcasted to all executors

Hash Join phase – small dataset is hashed in all the executors and joined with the partitioned big dataset.

Broadcast Hash Join doesn’t involve a sort operation. Click here to learn more about the internals of Broadcast Hash Join. 

Furthermore, the last union statement also does not involve a trigger so you can except the review join statements to perform better than the first one.

Big Data In Real World
Big Data In Real World
We are a group of Big Data engineers who are passionate about Big Data and related Big Data technologies. We have designed, developed, deployed and maintained Big Data applications ranging from batch to real time streaming big data platforms. We have seen a wide range of real world big data problems, implemented some innovative and complex (or simple, depending on how you look at it) solutions.

1 Comment

  1. […] How to avoid a Broadcast Nested Loop join in Spark? […]

How to avoid a Broadcast Nested Loop join in Spark?
This website uses cookies to improve your experience. By using this website you agree to our Data Protection Policy.

Hadoop In Real World is now Big Data In Real World!

X