How to get a specific version of a file from S3 using AWS CLI?
March 17, 2021How to fix the LEADER_NOT_AVAILABLE error in Kafka?
March 22, 2021Here are three high level reasons using cache() or persist() functions would be appropriate and helpful.
Performing multiple actions on the same RDD
numbersRDD is used in two count() actions. Below result is reading numbersRDD twice.
val numbersRDD = sc.textFile("/user/hirw/numbers.txt") val positiveCount = numbersRDD.filter(number => isPositive(number)).count() val negativeCount = numbersRDD.filter(number => isNegative(number)).count()
cache() on numbersRDD will avoid duplicate reads.
val numbersRDD = sc.textFile("/user/hirw/numbers.txt") numbersRDD.cache() val positiveCount = numbersRDD.filter(number => isPositive(number)).count() val negativeCount = numbersRDD.filter(number => isNegative(number)).count()
Do you like us to send you a 47 page Definitive guide on Spark join algorithms? ===>
Iterative transformations
Iterative transformations are common when you are working with data interactively in Spark Shell or by other means trying to analyze your data calling the same transformation functions or actions on the same RDD over and over. In that case it is helpful to cache() the base RDDs or DataFrame that you happen to build over and over.
Iterative computations are also common in machine learning use cases.
Debug memory or other data issues
cache() or persist() comes handy when you are troubleshooting a memory or other data issues. User cache() or persist() on data which you think is good and doesn’t require recomputation. This saves you a lot of time during a troubleshooting exercise.
1 Comment
[…] Check out this post if you are interested in knowing when to use cache or persist functions […]