How to specify join hints with Spark 3.0? - Big Data In Real World

How to specify join hints with Spark 3.0?

Why does Hadoop need classes like Text instead of String?
February 3, 2021
How to make an Amazon S3 bucket read-only?
February 8, 2021
Why does Hadoop need classes like Text instead of String?
February 3, 2021
How to make an Amazon S3 bucket read-only?
February 8, 2021

With Spark 3.0, we can specify the type of join algorithm we would like Spark to use at runtime.

Do you like us to send you a 47 page Definitive guide on Spark join algorithms? ===>

Broadcast Hash Join

Check this post to understand how Broad Hash Join works.

The join side with the hint will be broadcast regardless of the size limit specified in spark.sql.autoBroadcastJoinThreshold  property. If both sides of the join have the broadcast hints, the one with the smaller size (based on stats) will be broadcast.

There are 3 variations of this hint.

SELECT /*+ BROADCAST(t1) */ * FROM t1 INNER JOIN t2 ON t1.key = t2.key;

SELECT /*+ BROADCASTJOIN (t1) */ * FROM t1 left JOIN t2 ON t1.key = t2.key;

SELECT /*+ MAPJOIN(t2) */ * FROM t1 right JOIN t2 ON t1.key = t2.key;

 

Shuffle Sort Merge Join

Check this post to understand how Shuffle Sort Merge Join works. There are 3 variations of this hint.

SELECT /*+ SHUFFLE_MERGE(t1) */ * FROM t1 INNER JOIN t2 ON t1.key = t2.key;

SELECT /*+ MERGEJOIN(t2) */ * FROM t1 INNER JOIN t2 ON t1.key = t2.key;

SELECT /*+ MERGE(t1) */ * FROM t1 INNER JOIN t2 ON t1.key = t2.key;

 

Shuffle Hash Join

Check this post to understand how Shuffle Hash Join works. If both sides have the shuffle hash hints, Spark chooses the smaller side (based on stats).

SELECT /*+ SHUFFLE_HASH(t1) */ * FROM t1 INNER JOIN t2 ON t1.key = t2.key;

 

Shuffle-and-Replicate Nested Loop Join (a.k.a Cartiesian product Join)

Check this post to understand how Shuffle-and-Replicate Nested Loop Join works.

 

What happens if more than one hint is specified?

When different join strategy hints are specified on both sides of a join, Spark prioritizes the join algorithms or hint in the following order

  1. BROADCAST
  2. MERGE
  3. SHUFFLE_HASH
  4. SHUFFLE_REPLICATE_NL 

 

Big Data In Real World
Big Data In Real World
We are a group of Big Data engineers who are passionate about Big Data and related Big Data technologies. We have designed, developed, deployed and maintained Big Data applications ranging from batch to real time streaming big data platforms. We have seen a wide range of real world big data problems, implemented some innovative and complex (or simple, depending on how you look at it) solutions.

4 Comments

  1. […] post How to specify join hints with Spark 3.0? appeared first on Hadoop In Real […]

  2. […] post How to specify join hints with Spark 3.0? appeared first on Hadoop In Real […]

  3. […] Spark 3.0 we can specify the hints to instruct Spark to choose the join algorithm we prefer. Check this post to learn […]

  4. […] Spark 3.0 we can specify the hints to instruct Spark to choose the join algorithm we prefer. Check this post to learn […]

How to specify join hints with Spark 3.0?
This website uses cookies to improve your experience. By using this website you agree to our Data Protection Policy.

Hadoop In Real World is now Big Data In Real World!

X