What is the difference between cache and persist in Spark? - Big Data In Real World

What is the difference between cache and persist in Spark?

Where HDFS stores files locally in Datanodes?
March 24, 2021
How to get specific fields from a document in Elasticsearch?
March 29, 2021
Where HDFS stores files locally in Datanodes?
March 24, 2021
How to get specific fields from a document in Elasticsearch?
March 29, 2021

cache() and persist() functions are used to cache intermediate results of a RDD or DataFrame or Dataset. You can mark an RDD, DataFrame or Dataset to be persisted using the persist() or cache() methods on it. The first time it is computed in an action, the objects behind the RDD, DataFrame or Dataset on which cache() or persist() is called will be kept in memory or on the configured storage level on the nodes. 

Do you like us to send you a 47 page Definitive guide on Spark join algorithms? ===>

Before we look at the differences let’s look at the storage levels first.

Check out this post if you are interested in knowing when to use cache or persist functions

cache()

cache() doesn’t take any parameters

cache() on RDD will persist the objects in memory

RDD 

rdd.cache()

cache() on DataFrame or Dataset will persist the objects in memory_and_disk (check storage levels below)

DataFrame

df.cache()

Dataset

ds.cache()

persist()

There are 2 flavours of persist() functions 

persist() – without argument. When called without argument, calls cache() internally.

RDD

rdd.persist()

DataFrame

df.persist()

Dataset

ds.persist()

persist(StorageLevel) – with StorageLevel as argument

RDD

rdd.persist(StorageLevel.MEMORY_ONLY_SER)

DataFrame

df.persist(StorageLevel.DISK_ONLY)

Dataset

ds.persist(StorageLevel.MEMORY_AND_DISK)

Based on the provided StorageLevel, the behaviour of the persisted objects will vary.

Storage levels

Storage level defines 3 things – 

  1. whether the objects that are being persisted should be serialized or deserialized 
  2. whether the objects should be kept in memory or disk
  3. whether the persisted objects should be replicated or not

MEMORY_ONLY

Store RDD, DataFrame or Dataset as deserialized Java objects in the JVM. If the RDD does not fit in memory, some partitions will not be cached.

MEMORY_AND_DISK

Store RDD, DataFrame or Dataset as deserialized Java objects in the JVM. If the RDD does not fit in memory, store the partitions that don’t fit on disk, and read them from there when they’re needed.

MEMORY_ONLY_SER (Java and Scala)

Store RDD, DataFrame or Dataset as serialized Java objects (one byte array per partition). This is generally more space-efficient than deserialized objects, especially when using a fast serializer, but more CPU-intensive to read.

MEMORY_AND_DISK_SER (Java and Scala)

Similar to MEMORY_ONLY_SER, but spill partitions that don’t fit in memory to disk instead of recomputing them on the fly each time they’re needed.

DISK_ONLY

Store the RDD, DataFrame or Dataset partitions only on disk.

MEMORY_ONLY_2, MEMORY_AND_DISK_2, etc.

Same as the levels above, but replicate each partition on two cluster nodes.

OFF_HEAP (experimental)

Similar to MEMORY_ONLY_SER, but store the data in off-heap memory. This requires off-heap memory to be enabled.

Which storage level to use?

MEMORY_ONLY option is CPU efficient and optimal for performance

MEMORY_ONLY_SER option if the RDD, DataFrame or Dataset has a lot of elements. Use an efficient serialization framework like Kyro. This option is space efficient.

Spilling data to disk is expensive so use DISK_ONLY and options like MEMORY_AND_DISK only when the computation involved in getting to the RDD, DataFrame or Dataset that is being persisted is expensive.

Use replication options like MEMORY_ONLY_2 only you have a mandate for full fast recovery.

Big Data In Real World
Big Data In Real World
We are a group of Big Data engineers who are passionate about Big Data and related Big Data technologies. We have designed, developed, deployed and maintained Big Data applications ranging from batch to real time streaming big data platforms. We have seen a wide range of real world big data problems, implemented some innovative and complex (or simple, depending on how you look at it) solutions.

Comments are closed.

What is the difference between cache and persist in Spark?
This website uses cookies to improve your experience. By using this website you agree to our Data Protection Policy.

Hadoop In Real World is now Big Data In Real World!

X