What is the difference between map and flatMap functions in Spark? - Big Data In Real World

What is the difference between map and flatMap functions in Spark?

How to copy selective documents from one index to a new index in Elasticsearch?
March 2, 2022
Understanding stack function in Spark
March 16, 2022
How to copy selective documents from one index to a new index in Elasticsearch?
March 2, 2022
Understanding stack function in Spark
March 16, 2022

Both map and flatMap functions are transformation functions. When applied on RDD, map and flatMap transform each element inside the rdd to something.

Consider this simple RDD.

 scala> val rdd = sc.parallelize(Seq("Hadoop In Real World", "Big Data"))
 rdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at <console>:27 

The above rdd has 2 elements of type String.

map()

Let’s look at map() first. map() transforms and RDD with N elements to RDD with N elements. Important thing to note is each element is transformed into another element there by the resultant RDD will have the same elements as before.

split() function on this RDD, breaks the lines into words when it sees a space in between the words.

 rdd.map(_.split(" ")).collect
 res0: Array[Array[String]] = Array(Array(Hadoop, In, Real, World), Array(Big, Data)) 

The result RDD also has 2 elements but now the type of the element in an Array and not a String as the initial RDD. Each Array has a list of words from the initial String.

When we do a count, we will get 2. Which is same as the count of the initial RDD before the split() transformation.

 scala> rdd.map(_.split(" ")).count
 res5: Long = 2 

flatMap()

flatMap() transforms an RDD with N elements to an RDD with potentially more than N elements. Let’s try the same split() operation with flatMap()

 rdd.flatMap(_.split(" ")).collect
 res1: Array[String] = Array(Hadoop, In, Real, World, Big, Data) 

Here with flatMap() we can see that the result RDD is not Array[Array[String]] as with map() it is Array[String]. flatMap() took care of “flattening” the output to plain Strings from Array[String]

When you do a count() on the result RDD, we will get 6 which is more than the number of elements in the initial RDD.

 scala> rdd.flatMap(_.split(" ")).count
 res6: Long = 6 
Big Data In Real World
Big Data In Real World
We are a group of Big Data engineers who are passionate about Big Data and related Big Data technologies. We have designed, developed, deployed and maintained Big Data applications ranging from batch to real time streaming big data platforms. We have seen a wide range of real world big data problems, implemented some innovative and complex (or simple, depending on how you look at it) solutions.

Comments are closed.

What is the difference between map and flatMap functions in Spark?
This website uses cookies to improve your experience. By using this website you agree to our Data Protection Policy.

Hadoop In Real World is now Big Data In Real World!

X