When to use cache and persist functions in Spark? - Big Data In Real World

When to use cache and persist functions in Spark?

How to get a specific version of a file from S3 using AWS CLI?
March 17, 2021
How to fix the LEADER_NOT_AVAILABLE error in Kafka?
March 22, 2021
How to get a specific version of a file from S3 using AWS CLI?
March 17, 2021
How to fix the LEADER_NOT_AVAILABLE error in Kafka?
March 22, 2021

Here are three high level reasons using cache() or persist() functions would be appropriate and helpful.

Performing multiple actions on the same RDD

numbersRDD is used in two count() actions. Below result is reading numbersRDD twice.

val numbersRDD = sc.textFile("/user/hirw/numbers.txt")
val positiveCount = numbersRDD.filter(number => isPositive(number)).count()
val negativeCount = numbersRDD.filter(number => isNegative(number)).count()

cache() on numbersRDD will avoid duplicate reads.

val numbersRDD = sc.textFile("/user/hirw/numbers.txt")
numbersRDD.cache()
val positiveCount = numbersRDD.filter(number => isPositive(number)).count()
val negativeCount = numbersRDD.filter(number => isNegative(number)).count()

Do you like us to send you a 47 page Definitive guide on Spark join algorithms? ===>

Iterative transformations

Iterative transformations are common when you are working with data interactively in Spark Shell or by other means trying to analyze your data calling the same transformation functions or actions on the same RDD over and over. In that case it is helpful to cache() the base RDDs or DataFrame that you happen to build over and over.

Iterative computations are also common in machine learning use cases.

Debug memory or other data issues

cache() or persist() comes handy when you are troubleshooting a memory or other data issues. User cache() or persist() on data which you think is good and doesn’t require recomputation. This saves you a lot of time during a troubleshooting exercise.

 

Big Data In Real World
Big Data In Real World
We are a group of Big Data engineers who are passionate about Big Data and related Big Data technologies. We have designed, developed, deployed and maintained Big Data applications ranging from batch to real time streaming big data platforms. We have seen a wide range of real world big data problems, implemented some innovative and complex (or simple, depending on how you look at it) solutions.

1 Comment

  1. […] Check out this post if you are interested in knowing when to use cache or persist functions […]

When to use cache and persist functions in Spark?
This website uses cookies to improve your experience. By using this website you agree to our Data Protection Policy.

Hadoop In Real World is now Big Data In Real World!

X