How to export a Hive table into a CSV file?
August 6, 2021How to download an entire bucket from S3?
August 11, 2021Sometimes you might see a stage being skipped in the DAG visualization in Spark web UI. In this post we are going to discuss couple of reasons how a stage might be skipped during execution of a job in Spark.
Do you like us to send you a 47 page Definitive guide on Spark join algorithms? ===>
Cached data
If the data is cached or persisted by an explicit use of cache() or persist() you might see a stage being skipped when the result of the stage is already cached.
Shuffle data
Spark will automatically cache the data in the stage right after the shuffle. Shuffle is an expensive operation and hence Spark does this automatically. But note, the data will not be available for ever. This data will be evicted using Least Recently Used (LRU) strategy as soon as memory becomes unavailable for newer data.