Spark merge dataframe with mismatching schemas without extra disk IO

parquet schema merging is disabled by default, turn on this option by:

(1) set global option: spark.sql.parquet.mergeSchema=true

(2) write code: sqlContext.read.option("mergeSchema", "true").parquet("my.parquet")

You can append a null column to frame B and after union 2 frames:

import org.apache.spark.sql.functions._
val missingFields = A.schema.toSet.diff(B.schema.toSet)
var C: DataFrame = null
for (field <- missingFields){ 
   C = A.withColumn(field.name, expr("null")); 
} 
A.unionAll(C)

Tags:

Scala

Apache Spark

Related