What is the difference between map and mapPartitions in Spark? - Big Data In Real World

What is the difference between map and mapPartitions in Spark?

What is the difference between UDF, UDAF and UDTF in Hive?
April 30, 2021
How to perform range filtering like greater than, less than in Elasticsearch?
May 5, 2021
What is the difference between UDF, UDAF and UDTF in Hive?
April 30, 2021
How to perform range filtering like greater than, less than in Elasticsearch?
May 5, 2021

Both map and mapPartitions are narrow transformation functions. Both functions don’t trigger a shuffle.

Let’s say our RDD has 5 partitions and 10 elements in each partition. So a total of 50 elements in total. At execution each partition will be processed by a task. Each task gets executed on  worker node.

Do you like us to send you a 47 page Definitive guide on Spark join algorithms? ===>

map()

map function is a transformation function and gets applied on every element of the RDD and returns a new RDD with transformed elements.

In our sample RDD, the map function will be called on each element in the RDD. So when we apply the map function in our sample RDD it will be called 50 times.

mapPartitions()

mapPartitions is a transformation function and gets applied once per partition in the RDD. In our sample RDD, mapPartitions will be called once per partition so it will be called 5 times because we have 5 partitions in our RDD.

When to use map() and when to use mapPartitions()?

Use mapPartitions function when you need to perform heavy initialization before you transform the elements in the RDD. 

Let’s say you need a database connection to transformation elements in your RDD. It doesn’t make sense to initialize database connection for every element in RDD. If we do that, we will end up with initializing database connection 50 times. Which is not ideal.

Ideally we want to initialize database connection once per partition/task. So mapPartitions() is the right place to do database initialization as mapPartitions is applied once per partition.

Here is a code snipped which gives you an idea of how this can be implemented.

val rddTransformed = rdd.mapPartitions(partition => { 

/*DB init per partition - 5 times*/ 
val connection = new DBConnection() 

/*map() applied on each element - 50 times*/
val partitionTransformed = partition.map( element => { 
   dosomethingWithDBConnection(element, connection) 
}).toList 

/*Below code calls once per partition - 5 times*/
connection.close() // close dbconnection here 
partitionTransformed.iterator // returns iterator 
})

 

Big Data In Real World
Big Data In Real World
We are a group of Big Data engineers who are passionate about Big Data and related Big Data technologies. We have designed, developed, deployed and maintained Big Data applications ranging from batch to real time streaming big data platforms. We have seen a wide range of real world big data problems, implemented some innovative and complex (or simple, depending on how you look at it) solutions.

1 Comment

  1. […] If you are trying to transform the RDD, refer the post related to map and mapPartitions. […]

What is the difference between map and mapPartitions in Spark?
This website uses cookies to improve your experience. By using this website you agree to our Data Protection Policy.

Hadoop In Real World is now Big Data In Real World!

X