What does hadoop namenode -format do and is it safe to run?
March 10, 2021What is the difference between INNER JOIN and LEFT SEMI JOIN in Hive?
March 15, 2021Over time we have written several posts on Spark joins and join algorithms explaining the internal working of these join algorithms. Here are all the posts in one page. Bookmark this page and refer to it as required.
Do you like us to send you a 47 page Definitive guide on Spark join algorithms? ===>
Workings of different join algorithms in Spark
Below posts explain the join algorithm in detail and explain the internal working or implementation of the join algorithm in Spark with an example. We will also talk about when to use a certain join algorithm and when not to use a certain join algorithm.
- How does Broadcast Hash Join work in Spark?
- How does Shuffle Hash Join work in Spark?
- How does Shuffle Sort Merge Join work in Spark?
- How does Broadcast Nested Loop join in Spark?
- How does Cartesian Product Join work in Spark?
Join prioritization
In the below post we have summarized different scenarios and which join algorithm is appropriate for each one. Also we have discussed how Spark prioritizes one join algorithm over another.
How does Spark choose the join algorithm to use at runtime?
Spark 3.0
Spark 3.0 came out with improvement on how we can instruct Spark to use a certain join algorithm over another with the introduction of hints. Below post goes in detail about that.
How to specify join hints with Spark 3.0?