How to copy selective documents from one index to a new index in Elasticsearch?
March 2, 2022Understanding stack function in Spark
March 16, 2022Both map and flatMap functions are transformation functions. When applied on RDD, map and flatMap transform each element inside the rdd to something.
Consider this simple RDD.
scala> val rdd = sc.parallelize(Seq("Hadoop In Real World", "Big Data")) rdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at <console>:27
The above rdd has 2 elements of type String.
map()
Let’s look at map() first. map() transforms and RDD with N elements to RDD with N elements. Important thing to note is each element is transformed into another element there by the resultant RDD will have the same elements as before.
split() function on this RDD, breaks the lines into words when it sees a space in between the words.
rdd.map(_.split(" ")).collect res0: Array[Array[String]] = Array(Array(Hadoop, In, Real, World), Array(Big, Data))
The result RDD also has 2 elements but now the type of the element in an Array and not a String as the initial RDD. Each Array has a list of words from the initial String.
When we do a count, we will get 2. Which is same as the count of the initial RDD before the split() transformation.
scala> rdd.map(_.split(" ")).count res5: Long = 2
flatMap()
flatMap() transforms an RDD with N elements to an RDD with potentially more than N elements. Let’s try the same split() operation with flatMap()
rdd.flatMap(_.split(" ")).collect res1: Array[String] = Array(Hadoop, In, Real, World, Big, Data)
Here with flatMap() we can see that the result RDD is not Array[Array[String]] as with map() it is Array[String]. flatMap() took care of “flattening” the output to plain Strings from Array[String]
When you do a count() on the result RDD, we will get 6 which is more than the number of elements in the initial RDD.
scala> rdd.flatMap(_.split(" ")).count res6: Long = 6