Why does Cartesian Product Join aka Shuffle-and-Replication Nested Loop Join does not cause a shuffle?

Difference between EBS, S3 and Glacier in AWS?

March 3, 2021

Differences between RabbitMQ and Kafka

March 8, 2021

Published by Big Data In Real World at March 5, 2021

Reason

Internally partitions from one DataFrame are sent over the partitions in the other dataset over the network. This means a lot of partitions are moved over the network as a whole and a shuffle is not involved.

Once partitions from both dataset are available on one side, a nested loop join is performed.

If there are N records in one dataset and M records in the other dataset a nested loop is performed on N * M records.

Big Data In Real World

We are a group of Big Data engineers who are passionate about Big Data and related Big Data technologies. We have designed, developed, deployed and maintained Big Data applications ranging from batch to real time streaming big data platforms. We have seen a wide range of real world big data problems, implemented some innovative and complex (or simple, depending on how you look at it) solutions.

Comments are closed.

Why does Cartesian Product Join aka Shuffle-and-Replication Nested Loop Join does not cause a shuffle?

Difference between EBS, S3 and Glacier in AWS?

Differences between RabbitMQ and Kafka

Difference between EBS, S3 and Glacier in AWS?

Differences between RabbitMQ and Kafka

Reason

Big Data In Real World

Related posts

How to kill a running Spark application?

What is the default number of executors in Spark?

What is the default number of cores and amount of memory allocated to an application in Spark?