Difference between EBS, S3 and Glacier in AWS?
March 3, 2021Differences between RabbitMQ and Kafka
March 8, 2021Shuffle-and-Replication does not mean a “true” shuffle as in records with the same keys are sent to the same partition. Instead the entire partition of the dataset is sent over or replicated to all the partitions for a full cross or nested-loop join.
Do you like us to send you a 47 page Definitive guide on Spark join algorithms? ===>
This is how the execution of a Cartesian Product Join aka Shuffle-and-Replication Nested Loop Join will look like.
Reason
Internally partitions from one DataFrame are sent over the partitions in the other dataset over the network. This means a lot of partitions are moved over the network as a whole and a shuffle is not involved.
Once partitions from both dataset are available on one side, a nested loop join is performed.
If there are N records in one dataset and M records in the other dataset a nested loop is performed on N * M records.