Why is Namenode stuck in safe mode and how to fix it?
August 13, 2021How to get a list of consumers connected to a Kafka topic?
August 18, 2021RDD, Dataframe and Dataset are all Spark APIs introduced in Spark at different points in time. The goal of these API is to help us work with large datasets in a distributed fashion in Spark with performance in mind.
RDD
- Introduced in 2011 and is available in Spark since the beginning
- RDD is now considered to be a low level API
- RDD is still the core of Spark
- Whether you use Dataframe or Dataset, all your operations eventually get transformed to RDD
- RDD API provides many transformation functions like map(), filter() and reduce() etc. for performing computations on Data.
- RDD distribute a collection of JVM objects
- Can not catch syntax and analysis errors at compile time
Do you like us to send you a 47 page Definitive guide on Spark join algorithms? ===>
RDD Example
val rdd = sc.textFile("student.txt") val rdd_filtered = rdd.filter(_.age > 13) // transformation rdd_filtered.saveAsTextFile("students_over13.txt") // action
In this above example, first we read a file and we get a RDD. Next we apply a filter function on RDD which again results in a RDD and we are calling it rdd_filtered. filter() is a transformation function. All transformation functions result in a RDD. Finally we call saveAsTextFile on rdd_filtered which is an action function to save the contents of RDD.
DataFrame
- Introduced in 2013
- Considered high level API
- All operations with DataFrames go through Spark’s catalyst optimizer which converts our Dataframe code to an optimized set of code in RDD.
- Let’s you perform “SQL like” operations on data
- DataFrame distribute a collection of Row objects
- Catch syntax errors at compile time and analysis errors only at run time.
DataFrame Example
val df = spark.read.csv("student.txt") val df_filtered = df.filter("age > 13")
Dataset
- Introduced in 2015
- Considered high level API
- Just like Dataframes, all operations with Datasets go through Spark’s catalyst optimizer which converts our Dataframe code to an optimized set of code in RDD.
- Just like Dataframes, let’s you perform “SQL like” operations on data
- Dataset lets you work with data represented as an user defined object.
- Datasets are considered type safe
- DataFrame distribute a collection of user defined JVM objects but internally represented by Spark Row objects
- Dataframe is nothing both Dataset of type Row [Dataframe = Dataset<Row>]
- Catch syntax errors at compile time and due to type safety catch analysis errors also at compile time
Dataset Example
val ds = spark.read.csv("student.txt").as[Student] val ds_filtered =ds.filter("age > 13")
In the above example, we have created a Dataset of type Student. So all elements in the Dataset are Students and thus offering type safety.