You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
From the developers view an RDD is just a simple set of java or Scala objects representing data.
This interface has Java equivalent,
RDD’s provides many transformations such as map(), filter() and reduce() for performing computations on the data.
Each of these methods results in a new RDD representing the transformed data. However the action is not performed until an action method is called as collect() and saveAsObjectFile()
When performing computations in a single process as Spark can serialize the data into off-heap storage in a binary format and perform transformations directly on this off-heap memory avoiding the garbage collection costs.
Spark understands the schema and hence there is no need of java serialization to encode the data.
The disadvantage could be DataFrame API is different from the RDD’s as DataFrame API build a relational query that Spark’s catalyst optimizer can execute, this could be familiar with building query plans but not for majority of developers.
Also since code is referring to data attribute by name, it is not possible for compiler to catch any errors until runtime.
DataFrame API is **very much Scala centric **and limited support to Java.
Spark’s catalyst optimizer cannot infer the schema and assumes that any objects in the DataFrames implement scala.product.
Dataset API
Provides the object oriented programming style and compile time type safety of the RDD API but with the performance benefits of catalyst query optimizer.
Datasets also provides the same efficient off-heap storage mechanism as the DataFrame API.
Dataset API has the concept of encoders which translate between JVM representations(objects) and Spark’s internal binary format when it comes to serialization.
4 Spark has built in encoders which are very advanced in that they generate byte code to interact with off-heap data and provide on-demand access to individual attributes without having to de-serialize an entire object.
As similar to RDD code, even Dataset code is building a query plan, rather than dealing with individual objects, and if only one attribute is used in the code, only that attribute is accessed and the rest of the object’s data will not be read from the off-heap storage.
The Dataset API is designed to work equally well with both Java and Scala.
Spark does not yet provide the API for implementing custom encoders, but that is planned for a future release.
Conclusion
If we are developing primarily in Java then it is worth considering a move to Scala before adopting the DataFrame or Dataser API’s.
If we are developing in Scala and need code to go into production with Spark 1.6.0 then the DataFrame API is clearly the most stable option available and currently offers the best performance.
However, Dataset API preview looks very promising and provides a more natural way to code. Given the rapid evolution of Spark, it is likely that this API will mature very quickly for developing new applications.