How to handle null values when writing to parquet from Spark

I wrote a PySpark solution for this (df is a dataframe with columns of NullType):

# get dataframe schema
my_schema = list(df.schema)

null_cols = []

# iterate over schema list to filter for NullType columns
for st in my_schema:
    if str(st.dataType) == 'NullType':
        null_cols.append(st)

# cast null type columns to string (or whatever you'd like)
for ncol in null_cols:
    mycolname = str(ncol.name)
    df = df \
        .withColumn(mycolname, df[mycolname].cast('string'))

You misinterpreted SPARK-10943. Spark does support writing null values to numeric columns.

The problem is that null alone carries no type information at all

scala> spark.sql("SELECT null as comments").printSchema
root
 |-- comments: null (nullable = true)

As per comment by Michael Armbrust all you have to do is cast:

scala> spark.sql("""SELECT CAST(null as DOUBLE) AS comments""").printSchema
root
|-- comments: double (nullable = true)

and the result can be safely written to Parquet.

How to handle null values when writing to parquet from Spark

Tags:

Apache Spark

Parquet

Related

Recent Posts