How to convert RDD to DataFrame and Dataset in Spark?
February 10, 2021Apache Kafka vs Apache Storm
February 15, 2021There are several factors Spark takes into account before deciding on the type of join algorithm to use to join datasets at runtime.
Spark has the following 5 algorithms to choose from –
- Broadcast Hash Join
- Shuffle Hash Join
- Shuffle Sort Merge Join
- Broadcast Nested Loop Join
- Cartesian Product Join (a.k.a Shuffle-and-Replicate Nested Loop Join)
Do you like us to send you a 47 page Definitive guide on Spark join algorithms? ===>
Join condition (equi vs. non-equi joins)
= is an equi join
>=, < etc. are non-equi joins
Equi joins
Below join algorithms are possible for equi joins. Essentially all join algorithms are possible with equi joins.
- Broadcast Hash Join
- Shuffle Hash Join
- Shuffle Sort Merge Join
- Broadcast Nested Loop Join
- Cartesian Product Join (a.k.a Shuffle-and-Replicate Nested Loop Join)
Non-equi joins
Only 2 join algorithms are possible when you have non-equi joins.
- Broadcast Nested Loop Join
- Cartesian Product Join (a.k.a Shuffle-and-Replicate Nested Loop Join)
Join types
INNER, OUTER, LEFT SEMI etc. are the join types.
All join types
Below join algorithms support all join types.
- Shuffle Hash Join
- Shuffle Sort Merge Join
- Broadcast Nested Loop Join
All join types except full outer join
Broadcast Hash Join
Only INNER like joins
Cartesian Product Join
When hints are specified
With Spark 3.0 we can specify the hints to instruct Spark to choose the join algorithm we prefer. Check this post to learn how.
If it is an equi-join, Spark will give priority to the join algorithms in the below order.
- broadcast hint: pick broadcast hash join if the join type is supported. If both sides have the broadcast hints, choose the smaller side (based on stats) to broadcast.
- sort merge hint: pick sort merge join if join keys are sortable.
- shuffle hash hint: We pick shuffle hash join if the join type is supported. If both sides have the shuffle hash hints, choose the smaller side (based on stats) as the build side.
- shuffle replicate NL hint: pick cartesian product if join type is inner like.
When hints are not specified
When no hints are specified, Spark will give priority to the join algorithms in the below order.
- Pick broadcast hash join if one side is small enough to broadcast, and the join type is supported. If both sides are small, choose the smaller side (based on stats) to broadcast.
- Pick shuffle hash join if one side is small enough to build a local hash map, and is much smaller than the other side, and spark.sql.join.preferSortMergeJoin is false.
- Pick sort merge join if the join keys are sortable.
- Pick cartesian product if the join type is inner like.
- Pick broadcast nested loop join as the final solution. It may result in an Out of Memory exception but we don’t have other choices.
1 Comment
[…] post How does Spark choose the join algorithm to use at runtime? appeared first on Hadoop In Real […]