When to use cache and persist functions in Spark?

How to get a specific version of a file from S3 using AWS CLI?

March 17, 2021

How to fix the LEADER_NOT_AVAILABLE error in Kafka?

March 22, 2021

Published by Big Data In Real World at March 19, 2021

Performing multiple actions on the same RDD

numbersRDD is used in two count() actions. Below result is reading numbersRDD twice.

val numbersRDD = sc.textFile("/user/hirw/numbers.txt")
val positiveCount = numbersRDD.filter(number => isPositive(number)).count()
val negativeCount = numbersRDD.filter(number => isNegative(number)).count()

cache() on numbersRDD will avoid duplicate reads.

val numbersRDD = sc.textFile("/user/hirw/numbers.txt")
numbersRDD.cache()
val positiveCount = numbersRDD.filter(number => isPositive(number)).count()
val negativeCount = numbersRDD.filter(number => isNegative(number)).count()

Do you like us to send you a 47 page Definitive guide on Spark join algorithms? ===>

Iterative transformations

Iterative transformations are common when you are working with data interactively in Spark Shell or by other means trying to analyze your data calling the same transformation functions or actions on the same RDD over and over. In that case it is helpful to cache() the base RDDs or DataFrame that you happen to build over and over.

Iterative computations are also common in machine learning use cases.

Debug memory or other data issues

cache() or persist() comes handy when you are troubleshooting a memory or other data issues. User cache() or persist() on data which you think is good and doesn’t require recomputation. This saves you a lot of time during a troubleshooting exercise.

Big Data In Real World

We are a group of Big Data engineers who are passionate about Big Data and related Big Data technologies. We have designed, developed, deployed and maintained Big Data applications ranging from batch to real time streaming big data platforms. We have seen a wide range of real world big data problems, implemented some innovative and complex (or simple, depending on how you look at it) solutions.

1 Comment

What is the difference between cache and persist in Spark? - Hadoop In Real World says:

March 26, 2021 at 6:07 am

[…] Check out this post if you are interested in knowing when to use cache or persist functions […]

When to use cache and persist functions in Spark?

How to get a specific version of a file from S3 using AWS CLI?

How to fix the LEADER_NOT_AVAILABLE error in Kafka?

How to get a specific version of a file from S3 using AWS CLI?

How to fix the LEADER_NOT_AVAILABLE error in Kafka?

Performing multiple actions on the same RDD

Iterative transformations

Debug memory or other data issues

Big Data In Real World

Related posts

How to kill a running Spark application?

What is the default number of executors in Spark?

What is the default number of cores and amount of memory allocated to an application in Spark?

1 Comment