Spark dataframe: collect () vs select ()
calling select
will result is lazy
evaluation: for example:
val df1 = df.select("col1")
val df2 = df1.filter("col1 == 3")
both above statements create lazy path that will be executed when you call action on that df
, such as show
, collect
etc.
val df3 = df2.collect()
use .explain
at the end of your transformation to follow its plan
here is more detailed info Transformations and Actions
Actions vs Transformations
- Collect (Action) - Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data.
spark-sql doc
select(*cols) (transformation) - Projects a set of expressions and returns a new DataFrame.
Parameters: cols – list of column names (string) or expressions (Column). If one of the column names is ‘*’, that column is expanded to include all columns in the current DataFrame.**
df.select('*').collect() [Row(age=2, name=u'Alice'), Row(age=5, name=u'Bob')] df.select('name', 'age').collect() [Row(name=u'Alice', age=2), Row(name=u'Bob', age=5)] df.select(df.name, (df.age + 10).alias('age')).collect() [Row(name=u'Alice', age=12), Row(name=u'Bob', age=15)]
Execution select(column-name1,column-name2,etc)
method on a dataframe, returns a new dataframe which holds only the columns which were selected in the select()
function.
e.g. assuming df
has several columns including "name" and "value" and some others.
df2 = df.select("name","value")
df2
will hold only two columns ("name" and "value") out of the entire columns of df
df2 as the result of select
will be in the executors and not in the driver (as in the case of using collect()
)
sql-programming-guide
df.printSchema()
# root
# |-- age: long (nullable = true)
# |-- name: string (nullable = true)
# Select only the "name" column
df.select("name").show()
# +-------+
# | name|
# +-------+
# |Michael|
# | Andy|
# | Justin|
# +-------+
You can running collect()
on a dataframe (spark docs)
>>> l = [('Alice', 1)]
>>> spark.createDataFrame(l).collect()
[Row(_1=u'Alice', _2=1)]
>>> spark.createDataFrame(l, ['name', 'age']).collect()
[Row(name=u'Alice', age=1)]
spark docs
To print all elements on the driver, one can use the collect() method to first bring the RDD to the driver node thus: rdd.collect().foreach(println). This can cause the driver to run out of memory, though, because collect() fetches the entire RDD to a single machine; if you only need to print a few elements of the RDD, a safer approach is to use the take(): rdd.take(100).foreach(println).