Why does Hadoop need classes like Text instead of String?
February 3, 2021How to make an Amazon S3 bucket read-only?
February 8, 2021With Spark 3.0, we can specify the type of join algorithm we would like Spark to use at runtime.
Do you like us to send you a 47 page Definitive guide on Spark join algorithms? ===>
Broadcast Hash Join
Check this post to understand how Broad Hash Join works.
The join side with the hint will be broadcast regardless of the size limit specified in spark.sql.autoBroadcastJoinThreshold property. If both sides of the join have the broadcast hints, the one with the smaller size (based on stats) will be broadcast.
There are 3 variations of this hint.
SELECT /*+ BROADCAST(t1) */ * FROM t1 INNER JOIN t2 ON t1.key = t2.key;
SELECT /*+ BROADCASTJOIN (t1) */ * FROM t1 left JOIN t2 ON t1.key = t2.key;
SELECT /*+ MAPJOIN(t2) */ * FROM t1 right JOIN t2 ON t1.key = t2.key;
Shuffle Sort Merge Join
Check this post to understand how Shuffle Sort Merge Join works. There are 3 variations of this hint.
SELECT /*+ SHUFFLE_MERGE(t1) */ * FROM t1 INNER JOIN t2 ON t1.key = t2.key;
SELECT /*+ MERGEJOIN(t2) */ * FROM t1 INNER JOIN t2 ON t1.key = t2.key;
SELECT /*+ MERGE(t1) */ * FROM t1 INNER JOIN t2 ON t1.key = t2.key;
Shuffle Hash Join
Check this post to understand how Shuffle Hash Join works. If both sides have the shuffle hash hints, Spark chooses the smaller side (based on stats).
SELECT /*+ SHUFFLE_HASH(t1) */ * FROM t1 INNER JOIN t2 ON t1.key = t2.key;
Shuffle-and-Replicate Nested Loop Join (a.k.a Cartiesian product Join)
Check this post to understand how Shuffle-and-Replicate Nested Loop Join works.
What happens if more than one hint is specified?
When different join strategy hints are specified on both sides of a join, Spark prioritizes the join algorithms or hint in the following order
- BROADCAST
- MERGE
- SHUFFLE_HASH
- SHUFFLE_REPLICATE_NL
4 Comments
[…] post How to specify join hints with Spark 3.0? appeared first on Hadoop In Real […]
[…] post How to specify join hints with Spark 3.0? appeared first on Hadoop In Real […]
[…] Spark 3.0 we can specify the hints to instruct Spark to choose the join algorithm we prefer. Check this post to learn […]
[…] Spark 3.0 we can specify the hints to instruct Spark to choose the join algorithm we prefer. Check this post to learn […]