Spark – Page 2 – Big Data In Real World

October 4, 2021

Published by Big Data In Real World at October 4, 2021

Categories

Spark

What is the difference between foreach and foreachPartition in Spark?

foreach() and foreachPartition() are action function and not transform function. Both functions, since they are actions, they don’t return a RDD back. Do you like us […]

September 29, 2021

Published by Big Data In Real World at September 29, 2021

Categories

Spark

What are applications, jobs, stages and tasks in Spark?

We get a lot of questions on the differences in Spark applications, jobs, stages and tasks. Also we see there is a lot of misunderstanding about […]

September 24, 2021

Published by Big Data In Real World at September 24, 2021

Categories

Spark

Why do I see 200 tasks in Spark execution?

It is quite common to see 200 tasks in one of your stages and more specifically at a stage which requires wide transformation. The reason for […]

September 17, 2021

Published by Big Data In Real World at September 17, 2021

Categories

Spark

How does Spark decide stages and tasks during execution of a Job?

Let’s see this with an example. Here is our series of instructions in our Spark code. Let’s see how Spark decide on stages and tasks with […]

September 15, 2021

Published by Big Data In Real World at September 15, 2021

Categories

Spark

What are accumulators in Spark, when and when not to use them?

Accumulators are like global variables in Spark application. In the real world, accumulators are used as counters and keep to keep track of something at an […]

September 6, 2021

Published by Big Data In Real World at September 6, 2021

Categories

Spark

What do the numbers on the progress bar mean in Spark shell or Spark UI?

We are sure you have seen the below progress bar before in either Spark shell or while refreshing the Spark UI as your Spark job execution […]

August 25, 2021

Published by Big Data In Real World at August 25, 2021

Categories

Spark

How to subtract or see differences between two DataFrames in Spark?

Pretty simple. Use the except() to subtract or find the difference between two dataframes. Do you like us to send you a 47 page Definitive guide […]

August 20, 2021

Published by Big Data In Real World at August 20, 2021

Categories

Spark

Error initializing SparkContext with java.net.BindException

java.net.BindException is a common exception when Spark is trying to initialize SparkContext. This is especially a common error when you try to run Spark locally. 16/01/04 […]

August 16, 2021

Published by Big Data In Real World at August 16, 2021

Categories

Spark

What is the difference between RDD, Dataframe and Dataset in Spark?

RDD, Dataframe and Dataset are all Spark APIs introduced in Spark at different points in time. The goal of these API is to help us work […]

August 9, 2021

Published by Big Data In Real World at August 9, 2021

Categories

Spark

What does “Stage Skipped” mean in Apache Spark web UI?

Sometimes you might see a stage being skipped in the DAG visualization in Spark web UI. In this post we are going to discuss couple of […]

August 4, 2021

Published by Big Data In Real World at August 4, 2021

Categories

Spark

How does Spark decide the number of tasks and number of tasks to execute in parallel?

In this post we will see how Spark decides the number of tasks and number of tasks to execute in parallel in a job. Let’s see […]

May 31, 2021

Published by Big Data In Real World at May 31, 2021

Categories

How to save Spark DataFrame directly to a Hive table?

It is a very common use case to process the data in Spark and save the processed data or Spark dataframe directly into a Hive table. […]

May 24, 2021

Published by Big Data In Real World at May 24, 2021

Categories

Spark

What are broadcast variables in Spark and when to use them?

Broadcast variables are variables which are available in all executors executing the Spark application. These variables are already cached and ready to be used by tasks […]

May 19, 2021

Published by Big Data In Real World at May 19, 2021

Categories

Spark

How to find duplicate elements or rows in a Spark DataFrame?

It is a pretty common use case to find the list of duplicate elements or rows in a Spark DataFrame and it is very easy to […]

May 10, 2021

Published by Big Data In Real World at May 10, 2021

Categories

Spark

How to read multiple files into a single RDD or DataFrame in Spark?

We get this question a lot so we thought we would write a small post to answer this question. Spark leverages Hadoop’s InputFileFormat to read files […]

May 3, 2021

Published by Big Data In Real World at May 3, 2021

Categories

Spark

What is the difference between map and mapPartitions in Spark?

Both map and mapPartitions are narrow transformation functions. Both functions don’t trigger a shuffle. Let’s say our RDD has 5 partitions and 10 elements in each […]

April 19, 2021

Published by Big Data In Real World at April 19, 2021

Categories

Spark

What is the difference between reduceByKey and aggregateByKey in Spark?

reduceByKey() reduceByKey() has the below properties The result of the combination (e.g. a sum) is of the same type that the values The operation when combined […]

April 7, 2021

Published by Big Data In Real World at April 7, 2021

Categories

Spark

What is the difference between groupByKey and reduceByKey in Spark?

Both reduceByKey and groupByKey result in wide transformations which means both triggers a shuffle operation. The key difference between reduceByKey and groupByKey is that reduceByKey does […]

March 26, 2021

Published by Big Data In Real World at March 26, 2021

Categories

Spark

What is the difference between cache and persist in Spark?

cache() and persist() functions are used to cache intermediate results of a RDD or DataFrame or Dataset. You can mark an RDD, DataFrame or Dataset to […]

March 19, 2021

Published by Big Data In Real World at March 19, 2021

Categories

Spark

When to use cache and persist functions in Spark?

Here are three high level reasons using cache() or persist() functions would be appropriate and helpful. Performing multiple actions on the same RDD numbersRDD is used […]