What is the difference between UDF, UDAF and UDTF in Hive?
April 30, 2021How to perform range filtering like greater than, less than in Elasticsearch?
May 5, 2021Both map and mapPartitions are narrow transformation functions. Both functions don’t trigger a shuffle.
Let’s say our RDD has 5 partitions and 10 elements in each partition. So a total of 50 elements in total. At execution each partition will be processed by a task. Each task gets executed on worker node.
Do you like us to send you a 47 page Definitive guide on Spark join algorithms? ===>
map()
map function is a transformation function and gets applied on every element of the RDD and returns a new RDD with transformed elements.
In our sample RDD, the map function will be called on each element in the RDD. So when we apply the map function in our sample RDD it will be called 50 times.
mapPartitions()
mapPartitions is a transformation function and gets applied once per partition in the RDD. In our sample RDD, mapPartitions will be called once per partition so it will be called 5 times because we have 5 partitions in our RDD.
When to use map() and when to use mapPartitions()?
Use mapPartitions function when you need to perform heavy initialization before you transform the elements in the RDD.
Let’s say you need a database connection to transformation elements in your RDD. It doesn’t make sense to initialize database connection for every element in RDD. If we do that, we will end up with initializing database connection 50 times. Which is not ideal.
Ideally we want to initialize database connection once per partition/task. So mapPartitions() is the right place to do database initialization as mapPartitions is applied once per partition.
Here is a code snipped which gives you an idea of how this can be implemented.
val rddTransformed = rdd.mapPartitions(partition => { /*DB init per partition - 5 times*/ val connection = new DBConnection() /*map() applied on each element - 50 times*/ val partitionTransformed = partition.map( element => { dosomethingWithDBConnection(element, connection) }).toList /*Below code calls once per partition - 5 times*/ connection.close() // close dbconnection here partitionTransformed.iterator // returns iterator })
1 Comment
[…] If you are trying to transform the RDD, refer the post related to map and mapPartitions. […]