foreach() and foreachPartition() are action function and not transform function. Both functions, since they are actions, they don’t return a RDD back. Do you like us […]
Accumulators are like global variables in Spark application. In the real world, accumulators are used as counters and keep to keep track of something at an […]
java.net.BindException is a common exception when Spark is trying to initialize SparkContext. This is especially a common error when you try to run Spark locally. 16/01/04 […]
Broadcast variables are variables which are available in all executors executing the Spark application. These variables are already cached and ready to be used by tasks […]
We get this question a lot so we thought we would write a small post to answer this question. Spark leverages Hadoop’s InputFileFormat to read files […]
Both map and mapPartitions are narrow transformation functions. Both functions don’t trigger a shuffle. Let’s say our RDD has 5 partitions and 10 elements in each […]
reduceByKey() reduceByKey() has the below properties The result of the combination (e.g. a sum) is of the same type that the values The operation when combined […]
Both reduceByKey and groupByKey result in wide transformations which means both triggers a shuffle operation. The key difference between reduceByKey and groupByKey is that reduceByKey does […]
cache() and persist() functions are used to cache intermediate results of a RDD or DataFrame or Dataset. You can mark an RDD, DataFrame or Dataset to […]
Here are three high level reasons using cache() or persist() functions would be appropriate and helpful. Performing multiple actions on the same RDD numbersRDD is used […]