pyspark dataframe add a column if it doesn't exist
You can check if colum is available in dataframe and modify df
only if necessary:
if not 'f' in df.columns:
df = df.withColumn('f', f.lit(''))
For nested schemas you may need to use df.schema
like below:
>>> df.printSchema()
root
|-- a: struct (nullable = true)
| |-- b: long (nullable = true)
>>> 'b' in df.schema['a'].dataType.names
True
>>> 'x' in df.schema['a'].dataType.names
False
In case someone needs this in Scala:
if (!df.columns.contains("f")) {
val newDf = df.withColumn("f", lit(""))
}