How to show full column content in a Spark DataFrame?
December 7, 2020What is the difference between hivevar and hiveconf?
December 11, 2020You have RDD in your code and now you want to work the data using DataFrames in Spark. Spark provides you with functions to convert RDD to DataFrames and it is quite simple.
Do you like us to send you a 47 page Definitive guide on Spark join algorithms? ===>
Solution
You can use the create DataFrame function which takes in RDD and returns you a DataFrame.
Assume this is the data in you your RDD
+------+----+----+ | blue| 20.0| 60.0| |green| 30.5| 20.0| | red| 70.0| 50.9| +------+----+----+
Without column names
You can see that the output doesn’t have meaningful column names.
val df = spark.createDataFrame(rdd) df.show() +------+----+----+ | _1 | _2 | _3 | +------+----+----+ | blue| 20.0| 60.0| |green| 30.5| 20.0| | red| 70.0| 50.9| +------+----+----+
With column names
With below you specify the columns but still Spark infers the schema – data types of your columns.
val df1 = spark.createDataFrame(rdd).toDF("id", "val1", “val2”) df1.show() +------+----+----+ | id | val1| val2| +------+----+----+ | blue| 20.0| 60.0| |green| 30.5| 20.0| | red| 70.0| 50.9| +------+----+----+
With proper schema
val schema = new StructType() .add(StructField("id", StringType, true)) .add(StructField("val1", DoubleType, true)) .add(StructField("val2", DoubleType, true)) val df2 = spark.createDataFrame(rdd, schema) df2.show() +------+----+----+ | id | val1| val2| +------+----+----+ | blue| 20.0| 60.0| |green| 30.5| 20.0| | red| 70.0| 50.9| +------+----+----+