What is the difference between Spark DataSet and RDD

At this moment (Spark 1.6.0) DataSet API is just a preview and only a small subset of features is implemented so it is not possible to tell anything about best practices.

Conceptually Spark DataSet is just a DataFrame with additional type safety (or if you prefer a glance at the future DataFrame is a DataSet[Row]). It means you get all the benefits of Catalyst and Tungsten. It includes logical and physical plan optimization, vectorized operations and low level memory management.

What you loose is flexibility and transparency.

First of all your data has to be encoded before it can be used with DataSet. Spark provides encoders for primitive types and Products / case classes and as for now API required to define custom serialization is not available. Most likely it will be relatively similar to UDT API (see for example How to define schema for custom type in Spark SQL?, Serialize/Deserialize existing class for spark sql dataframe) with all its issues. It is relatively verbose, requires additional effort and can become far from obvious with complex objects. Moreover it touches some lower level aspects of the API which are not very well documented.

Regarding transparency it is pretty much the same problem as with a planner in a typical RDBMS. It is great until it isn't. It is amazing tool, it can analyze your data, make smart transformations but as any tool it can take a wrong path and leaves staring into execution plan and trying to figure out how to make things work.

Based on a preview I would say it can be placed somewhere between DataFrame API and RDD API. It is more flexible than DataFrames but still provides similar optimizations and is well suited for general data processing tasks. It doesn't provide the same flexibility (at least without a deeper dive into Catalyst internals) as a RDD API.

Another difference, which is at this moment just hypothetical, is a way how it interacts with guest languages (R, Python). Similar to DataFrame, DataSet belongs to JVM. It means that any possible interaction can belong to the one of two categories: native JVM operation (like DataFrame expressions) and guest side code (like Python UDF). Unfortunately the second part requires expensive round-trip between JVM and a guest environment.

What is the difference between Spark DataSet and RDD

Tags:

Apache Spark

Rdd

Apache Spark Dataset

Related

Recent Posts