Spark – Page 3 – Big Data In Real World

March 12, 2021

Published by Big Data In Real World at March 12, 2021

Categories

Spark

Definitive guide on Spark join algorithms

Over time we have written several posts on Spark joins and join algorithms explaining the internal working of these join algorithms. Here are all the posts […]

March 5, 2021

Published by Big Data In Real World at March 5, 2021

Categories

Spark

Why does Cartesian Product Join aka Shuffle-and-Replication Nested Loop Join does not cause a shuffle?

Shuffle-and-Replication does not mean a “true” shuffle as in records with the same keys are sent to the same partition. Instead the entire partition of the […]

February 26, 2021

Published by Big Data In Real World at February 26, 2021

Categories

Spark

How to avoid a Broadcast Nested Loop join in Spark?

Broadcast Nested Loop join works by broadcasting one of the entire datasets and performing a nested loop to join the data. So essentially every record from […]

February 19, 2021

Published by Big Data In Real World at February 19, 2021

Categories

Spark

How to force Spark to use Shuffle Hash Join when it defaults to Sort Merge Join?

Take a look at the below execution plan. Currently when you print the executed plan, you see that Spark is using Sort Merge Join. scala> dfJoined.queryExecution.executedPlan […]

February 12, 2021

Published by Big Data In Real World at February 12, 2021

Categories

Spark

How does Spark choose the join algorithm to use at runtime?

There are several factors Spark takes into account before deciding on the type of join algorithm to use to join datasets at runtime. Spark has the […]

February 10, 2021

Published by Big Data In Real World at February 10, 2021

Categories

Spark

How to convert RDD to DataFrame and Dataset in Spark?

Let’s create RDD first. Below we are creating a RDD with some sample data. scala> val data = Seq( | (1, "Shirt", 20, […]

February 5, 2021

Published by Big Data In Real World at February 5, 2021

Categories

Spark

How to specify join hints with Spark 3.0?

With Spark 3.0, we can specify the type of join algorithm we would like Spark to use at runtime. Do you like us to send you […]

January 29, 2021

Published by Big Data In Real World at January 29, 2021

Categories

Spark

How does Cartesian Product Join work in Spark?

Cartesian Product Join (a.k.a Shuffle-and-Replication Nested Loop) join works very similar to a Broadcast Nested Loop join except the dataset is not broadcasted. Shuffle-and-Replication does not […]

January 25, 2021

Published by Big Data In Real World at January 25, 2021

Categories

Spark

How to control log settings in Spark and stop INFO messages?

Spark spits out a lot of INFO messages in the logs. These messages stand in the way when you are troubleshooting an error and looking for […]

January 22, 2021

Published by Big Data In Real World at January 22, 2021

Categories

Spark

How does Shuffle Sort Merge Join work in Spark?

Shuffle Sort Merge Join, as the name indicates, involves a sort operation. Shuffle Sort Merge Join has 3 phases. Shuffle Phase – both datasets are shuffled […]

January 15, 2021

Published by Big Data In Real World at January 15, 2021

Categories

Spark

How does Broadcast Hash Join work in Spark?

Broadcast Hash Join in Spark works by broadcasting the small dataset to all the executors and once the data is broadcasted a standard hash join is […]

January 13, 2021

Published by Big Data In Real World at January 13, 2021

Categories

How to merge multiple output files from MapReduce or Spark jobs to one?

When you run a MapReduce or a Spark job the number of files will equal to the number of reducers involved in the MapReduce job or […]

January 8, 2021

Published by Big Data In Real World at January 8, 2021

Categories

Spark

How does Broadcast Nested Loop Join work in Spark?

Broadcast Nested Loop join works by broadcasting one of the entire datasets and performing a nested loop to join the data. So essentially every record from […]

January 1, 2021

Published by Big Data In Real World at January 1, 2021

Categories

Spark

How does Shuffle Hash Join work in Spark?

Shuffle Hash Join, as the name indicates works by shuffling both datasets. So the same keys from both sides end up in the same partition or […]

December 23, 2020

Published by Big Data In Real World at December 23, 2020

Categories

Spark

How to convert List to a JavaRDD in Spark?

This is a very common use case, if you are working on a Spark project and writing code in Java. You have a list and now […]

December 21, 2020

Published by Big Data In Real World at December 21, 2020

Categories

Spark

What is the difference between repartition and coalesce in Spark?

The function of repartition and coalesce functions in Spark is to change the number of partitions on a DataFrame. Repartition Increase or decrease partitions. Repartition always […]

December 9, 2020

Published by Big Data In Real World at December 9, 2020

Categories

Hadoop
Spark

How to convert RDD to DataFrame in spark?

You have RDD in your code and now you want to work the data using DataFrames in Spark. Spark provides you with functions to convert RDD to […]

December 7, 2020

Published by Big Data In Real World at December 7, 2020

Categories

Spark

How to show full column content in a Spark DataFrame?

Most often when we are trying to work with data in Spark we might want to preview the data or the solution in Spark shell right […]

May 18, 2020

Published by Big Data In Real World at May 18, 2020

Categories

Spark

How To Catch Malware Using Spark

Those in the industry know that one of the best techniques for catching malware using Machine Learning (ML) is only possible with distributed computing. Before showing […]

March 25, 2019

Published by Big Data In Real World at March 25, 2019

Categories

Improving Performance In Spark Using Partitions

In this blog post we are going to show how to optimize your Spark job by partitioning the data correctly. To demonstrate this we are going […]