Why does Cartesian Product Join aka Shuffle-and-Replication Nested Loop Join does not cause a shuffle? - Big Data In Real World

Why does Cartesian Product Join aka Shuffle-and-Replication Nested Loop Join does not cause a shuffle?

Difference between EBS, S3 and Glacier in AWS?
March 3, 2021
Differences between RabbitMQ and Kafka
March 8, 2021
Difference between EBS, S3 and Glacier in AWS?
March 3, 2021
Differences between RabbitMQ and Kafka
March 8, 2021

Shuffle-and-Replication does not mean a “true” shuffle as in records with the same keys are sent to the same partition. Instead the entire partition of the dataset is sent over or replicated to all the partitions for a full cross or nested-loop join.

Do you like us to send you a 47 page Definitive guide on Spark join algorithms? ===>

This is how the execution of a Cartesian Product Join aka Shuffle-and-Replication Nested Loop Join will look like.
Cartesian product join Spark stages

Reason

Internally partitions from one DataFrame are sent over the partitions in the other dataset over the network. This means a lot of partitions are moved over the network as a whole and a shuffle is not involved.

Once partitions from both dataset are available on one side, a nested loop join is performed.

If there are N records in one dataset and M records in the other dataset a nested loop is performed on N * M records.Cartesian product join Stage 1

 

Big Data In Real World
Big Data In Real World
We are a group of Big Data engineers who are passionate about Big Data and related Big Data technologies. We have designed, developed, deployed and maintained Big Data applications ranging from batch to real time streaming big data platforms. We have seen a wide range of real world big data problems, implemented some innovative and complex (or simple, depending on how you look at it) solutions.

Comments are closed.

Why does Cartesian Product Join aka Shuffle-and-Replication Nested Loop Join does not cause a shuffle?
This website uses cookies to improve your experience. By using this website you agree to our Data Protection Policy.

Hadoop In Real World is now Big Data In Real World!

X