How to convert RDD to DataFrame in spark?

How to show full column content in a Spark DataFrame?

December 7, 2020

What is the difference between hivevar and hiveconf?

December 11, 2020

Published by Big Data In Real World at December 9, 2020

Solution

You can use the create DataFrame function which takes in RDD and returns you a DataFrame.

Assume this is the data in you your RDD

+------+----+----+ 
| blue| 20.0| 60.0| 
|green| 30.5| 20.0| 
|  red| 70.0| 50.9| 
+------+----+----+

Without column names

You can see that the output doesn’t have meaningful column names.

val df = spark.createDataFrame(rdd)

df.show()

+------+----+----+ 
|  _1	| _2  | _3 |
+------+----+----+ 
| blue| 20.0| 60.0| 
|green| 30.5| 20.0| 
|  red| 70.0| 50.9| 
+------+----+----+

With column names

With below you specify the columns but still Spark infers the schema – data types of your columns.

val df1 = spark.createDataFrame(rdd).toDF("id", "val1", “val2”)

df1.show()

+------+----+----+ 
|  id | val1| val2|
+------+----+----+ 
| blue| 20.0| 60.0| 
|green| 30.5| 20.0| 
|  red| 70.0| 50.9| 
+------+----+----+

With proper schema

val schema = new StructType()
.add(StructField("id", StringType, true))
.add(StructField("val1", DoubleType, true))
.add(StructField("val2", DoubleType, true))

val df2 = spark.createDataFrame(rdd, schema)

df2.show()

+------+----+----+ 
|  id | val1| val2|
+------+----+----+ 
| blue| 20.0| 60.0| 
|green| 30.5| 20.0| 
|  red| 70.0| 50.9| 
+------+----+----+

Big Data In Real World

We are a group of Big Data engineers who are passionate about Big Data and related Big Data technologies. We have designed, developed, deployed and maintained Big Data applications ranging from batch to real time streaming big data platforms. We have seen a wide range of real world big data problems, implemented some innovative and complex (or simple, depending on how you look at it) solutions.

Comments are closed.

How to convert RDD to DataFrame in spark?

How to show full column content in a Spark DataFrame?

What is the difference between hivevar and hiveconf?

How to show full column content in a Spark DataFrame?

What is the difference between hivevar and hiveconf?

Solution

Without column names

With column names

With proper schema

Big Data In Real World

Related posts

Sunset: Hadoop Developer In Real World cluster

How to kill a running Spark application?

What is the default number of executors in Spark?