What is the difference between map and mapValues functions in Spark? - Big Data In Real World

What is the difference between map and mapValues functions in Spark?

What is an alias and how to create an alias in Elasticsearch?
April 10, 2023
How to view a message in Kafka?
April 17, 2023
What is an alias and how to create an alias in Elasticsearch?
April 10, 2023
How to view a message in Kafka?
April 17, 2023

In this post we will look at the differences between map and mapValues functions and when it is appropriate to use either one.

We have a small made up dataset with Day and temperature in Fahrenheit. Let’s use both map() and mapValues() to convert them to Celsius.

map()

Both map and mapValues are transformation functions

With map(), we will have access to both the key and value (x._1 and x._2) so we can transform both key and value if we choose to. (for eg. we can change the key, day to all uppercase if we have to)

Returns Array[(String, Double)]

val rdd = sc.parallelize(Seq(("Sunday", 50), ("Monday", 60), ("Tuesday", 65), ("Wednesday", 70), ("Thursday", 85), ("Friday", 25), ("Saturday", 15)))

rdd.map { x =>
  val ctemp = (x._2 - 32)*.55
  (x._1, ctemp)
}.collect

res1: Array[(String, Double)] = Array((Sunday,9.9), (Monday,15.400000000000002), (Tuesday,18.150000000000002), (Wednesday,20.900000000000002), (Thursday,29.150000000000002), (Friday,-3.8500000000000005), (Saturday,-9.350000000000001))

mapValues()

Both map and mapValues are transformation functions

With mapValues(), unlike map(), we will not have access to the key. We will only have access to value. Which means we can only transform value and not key.

Just like map(), returns Array[(String, Double)]

mapValues() differ from map() when we use custom partitioners. If we applied any custom partitioning to our RDD (e.g. using partitionBy), using map would “forget” that partitioner (the result will revert to default partitioning) as the keys might have changed; mapValues, however, preserves any partitioner set on the RDD because the keys don’t change with mapValues as it doesn’t have access to the keys in the first place.

rdd.mapValues { x =>
  (x - 32)*.55
}.collect

res4: Array[(String, Double)] = Array((Sunday,9.9), (Monday,15.400000000000002), (Tuesday,18.150000000000002), (Wednesday,20.900000000000002), (Thursday,29.150000000000002), (Friday,-3.8500000000000005), (Saturday,-9.350000000000001))
Big Data In Real World
Big Data In Real World
We are a group of Big Data engineers who are passionate about Big Data and related Big Data technologies. We have designed, developed, deployed and maintained Big Data applications ranging from batch to real time streaming big data platforms. We have seen a wide range of real world big data problems, implemented some innovative and complex (or simple, depending on how you look at it) solutions.

Comments are closed.

What is the difference between map and mapValues functions in Spark?
This website uses cookies to improve your experience. By using this website you agree to our Data Protection Policy.

Hadoop In Real World is now Big Data In Real World!

X