What is the difference between RDD, Dataframe and Dataset in Spark?

Why is Namenode stuck in safe mode and how to fix it?

August 13, 2021

How to get a list of consumers connected to a Kafka topic?

August 18, 2021

Published by Big Data In Real World at August 16, 2021

RDD

Introduced in 2011 and is available in Spark since the beginning
RDD is now considered to be a low level API
RDD is still the core of Spark
Whether you use Dataframe or Dataset, all your operations eventually get transformed to RDD
RDD API provides many transformation functions like map(), filter() and reduce() etc. for performing computations on Data.
RDD distribute a collection of JVM objects
Can not catch syntax and analysis errors at compile time

Do you like us to send you a 47 page Definitive guide on Spark join algorithms? ===>

RDD Example

val rdd = sc.textFile("student.txt")
val rdd_filtered = rdd.filter(_.age > 13) // transformation
rdd_filtered.saveAsTextFile("students_over13.txt") // action

In this above example, first we read a file and we get a RDD. Next we apply a filter function on RDD which again results in a RDD and we are calling it rdd_filtered. filter() is a transformation function. All transformation functions result in a RDD. Finally we call saveAsTextFile on rdd_filtered which is an action function to save the contents of RDD.

DataFrame

Introduced in 2013
Considered high level API
All operations with DataFrames go through Spark’s catalyst optimizer which converts our Dataframe code to an optimized set of code in RDD.
Let’s you perform “SQL like” operations on data
DataFrame distribute a collection of Row objects
Catch syntax errors at compile time and analysis errors only at run time.

DataFrame Example

val df = spark.read.csv("student.txt")

val df_filtered = df.filter("age > 13")

Dataset

Introduced in 2015
Considered high level API
Just like Dataframes, all operations with Datasets go through Spark’s catalyst optimizer which converts our Dataframe code to an optimized set of code in RDD.
Just like Dataframes, let’s you perform “SQL like” operations on data
Dataset lets you work with data represented as an user defined object.
Datasets are considered type safe
DataFrame distribute a collection of user defined JVM objects but internally represented by Spark Row objects
Dataframe is nothing both Dataset of type Row [Dataframe = Dataset<Row>]
Catch syntax errors at compile time and due to type safety catch analysis errors also at compile time

Dataset Example

val ds = spark.read.csv("student.txt").as[Student] 

val ds_filtered =ds.filter("age > 13")

In the above example, we have created a Dataset of type Student. So all elements in the Dataset are Students and thus offering type safety.

Big Data In Real World

We are a group of Big Data engineers who are passionate about Big Data and related Big Data technologies. We have designed, developed, deployed and maintained Big Data applications ranging from batch to real time streaming big data platforms. We have seen a wide range of real world big data problems, implemented some innovative and complex (or simple, depending on how you look at it) solutions.

Comments are closed.

What is the difference between RDD, Dataframe and Dataset in Spark?

Why is Namenode stuck in safe mode and how to fix it?

How to get a list of consumers connected to a Kafka topic?

Why is Namenode stuck in safe mode and how to fix it?

How to get a list of consumers connected to a Kafka topic?

RDD

RDD Example

DataFrame

DataFrame Example

Dataset

Dataset Example

Big Data In Real World

Related posts

How to kill a running Spark application?

What is the default number of executors in Spark?

What is the default number of cores and amount of memory allocated to an application in Spark?