If you are working with Spark, you will come across the three APIs: DataFrames, Datasets, and RDDs
RDD or Resilient Distributed Datasets, is a collection of records with distributed computing, which are fault tolerant, immutable in nature. They can be operated on in parallel with low-level APIs, while their lazy feature makes the spark operation to work at an improved speed. RDDs support two types of operations:
If you choose to work with RDD you will have to optimize each and every RDD. In addition, unlike Datasets and DataFrames, RDDs don’t infer the schema of the data ingested therefore you will have to specify it.
DataFrames is a distributed collection of rows under named columns. In simple terms, it looks like an Excel sheet with Column headers, or you can think of it as the equivalent to a table in a relational database or a DataFrame in R or Python. It has three main common characteristics with RDD:
In Spark DataFrames can be created in several ways:
The main drawback of DataFrame API is that it does not support compile time safely, as a result, the user is limited in case the structure of the data is not known.
A Dataset is a strongly-typed, immutable collection of objects that are mapped to a relational schema. A Dataset can be created using JVM objects and manipulated using complex functional transformations. Datasets can be created in two ways:
The main disadvantage of datasets is that they require typecasting into strings.