How does Spark choose the join algorithm to use at runtime? - Big Data In Real World

How does Spark choose the join algorithm to use at runtime?

How to convert RDD to DataFrame and Dataset in Spark?
February 10, 2021
Apache Kafka vs Apache Storm
February 15, 2021
How to convert RDD to DataFrame and Dataset in Spark?
February 10, 2021
Apache Kafka vs Apache Storm
February 15, 2021

There are several factors Spark takes into account before deciding on the type of join algorithm to use to join datasets at runtime.

Spark has the following 5 algorithms to choose from –

  1. Broadcast Hash Join
  2. Shuffle Hash Join
  3. Shuffle Sort Merge Join
  4. Broadcast Nested Loop Join
  5. Cartesian Product Join (a.k.a Shuffle-and-Replicate Nested Loop Join)

Do you like us to send you a 47 page Definitive guide on Spark join algorithms? ===>

Join condition (equi vs. non-equi joins)

= is an equi join

>=, < etc. are non-equi joins

Equi joins

Below join algorithms are possible for equi joins. Essentially all join algorithms are possible with equi joins.

  1. Broadcast Hash Join
  2. Shuffle Hash Join
  3. Shuffle Sort Merge Join
  4. Broadcast Nested Loop Join
  5. Cartesian Product Join (a.k.a Shuffle-and-Replicate Nested Loop Join)

Non-equi joins

Only 2 join algorithms are possible when you have non-equi joins.

  1. Broadcast Nested Loop Join
  2. Cartesian Product Join (a.k.a Shuffle-and-Replicate Nested Loop Join)

Join types

INNER, OUTER, LEFT SEMI etc. are the join types.

All join types

Below join algorithms support all join types.

  1. Shuffle Hash Join
  2. Shuffle Sort Merge Join
  3. Broadcast Nested Loop Join

All join types except full outer join

Broadcast Hash Join

Only INNER like joins

Cartesian Product Join 

When hints are specified

With Spark 3.0 we can specify the hints to instruct Spark to choose the join algorithm we prefer. Check this post to learn how.

If it is an equi-join, Spark will give priority to the join algorithms in the below order.

  1. broadcast hint: pick broadcast hash join if the join type is supported. If both sides have the broadcast hints, choose the smaller side (based on stats) to broadcast.
  2. sort merge hint: pick sort merge join if join keys are sortable.
  3. shuffle hash hint: We pick shuffle hash join if the join type is supported. If both sides have the shuffle hash hints, choose the smaller side (based on stats) as the build side.
  4. shuffle replicate NL hint: pick cartesian product if join type is inner like.

When hints are not specified

When no hints are specified, Spark will give priority to the join algorithms in the below order.

  1. Pick broadcast hash join if one side is small enough to broadcast, and the join type is supported. If both sides are small, choose the smaller side (based on stats) to broadcast.
  2. Pick shuffle hash join if one side is small enough to build a local hash map, and is much smaller than the other side, and spark.sql.join.preferSortMergeJoin  is false.
  3. Pick sort merge join if the join keys are sortable.
  4. Pick cartesian product if the join type is inner like.
  5. Pick broadcast nested loop join as the final solution. It may result in an Out of Memory exception but we don’t have other choices.

 

Big Data In Real World
Big Data In Real World
We are a group of Big Data engineers who are passionate about Big Data and related Big Data technologies. We have designed, developed, deployed and maintained Big Data applications ranging from batch to real time streaming big data platforms. We have seen a wide range of real world big data problems, implemented some innovative and complex (or simple, depending on how you look at it) solutions.

1 Comment

  1. […] post How does Spark choose the join algorithm to use at runtime? appeared first on Hadoop In Real […]

How does Spark choose the join algorithm to use at runtime?
This website uses cookies to improve your experience. By using this website you agree to our Data Protection Policy.

Hadoop In Real World is now Big Data In Real World!

X