Spark unionAll multiple dataframes
The simplest solution is to reduce
with union
(unionAll
in Spark < 2.0):
val dfs = Seq(df1, df2, df3)
dfs.reduce(_ union _)
This is relatively concise and shouldn't move data from off-heap storage but extends lineage with each union requires non-linear time to perform plan analysis. what can be a problem if you try to merge large number of DataFrames
.
You can also convert to RDDs
and use SparkContext.union
:
dfs match {
case h :: Nil => Some(h)
case h :: _ => Some(h.sqlContext.createDataFrame(
h.sqlContext.sparkContext.union(dfs.map(_.rdd)),
h.schema
))
case Nil => None
}
It keeps lineage short analysis cost low but otherwise it is less efficient than merging DataFrames
directly.
For pyspark you can do the following:
from functools import reduce
from pyspark.sql import DataFrame
dfs = [df1,df2,df3]
df = reduce(DataFrame.unionAll, dfs)
It's also worth nothing that the order of the columns in the dataframes should be the same for this to work. This can silently give unexpected results if you don't have the correct column orders!!
If you are using pyspark 2.3 or greater, you can use unionByName so you don't have to reorder the columns.