What is the difference between Spark DataSet and RDD
At this moment (Spark 1.6.0) DataSet
API is just a preview and only a small subset of features is implemented so it is not possible to tell anything about best practices.
Conceptually Spark DataSet
is just a DataFrame
with additional type safety (or if you prefer a glance at the future DataFrame
is a DataSet[Row]
). It means you get all the benefits of Catalyst and Tungsten. It includes logical and physical plan optimization, vectorized operations and low level memory management.
What you loose is flexibility and transparency.
First of all your data has to be encoded before it can be used with DataSet
. Spark provides encoders for primitive types and Products / case classes and as for now API required to define custom serialization is not available. Most likely it will be relatively similar to UDT API (see for example How to define schema for custom type in Spark SQL?, Serialize/Deserialize existing class for spark sql dataframe) with all its issues. It is relatively verbose, requires additional effort and can become far from obvious with complex objects. Moreover it touches some lower level aspects of the API which are not very well documented.
Regarding transparency it is pretty much the same problem as with a planner in a typical RDBMS. It is great until it isn't. It is amazing tool, it can analyze your data, make smart transformations but as any tool it can take a wrong path and leaves staring into execution plan and trying to figure out how to make things work.
Based on a preview I would say it can be placed somewhere between DataFrame
API and RDD API. It is more flexible than DataFrames
but still provides similar optimizations and is well suited for general data processing tasks. It doesn't provide the same flexibility (at least without a deeper dive into Catalyst internals) as a RDD API.
Another difference, which is at this moment just hypothetical, is a way how it interacts with guest languages (R, Python). Similar to DataFrame
, DataSet
belongs to JVM. It means that any possible interaction can belong to the one of two categories: native JVM operation (like DataFrame
expressions) and guest side code (like Python UDF). Unfortunately the second part requires expensive round-trip between JVM and a guest environment.
See also:
- Difference between DataSet API and DataFrame