Spark 2.0.x dump a csv file from a dataframe containing one array of type string

The reason why you are getting this error is that csv file format doesn't support array types, you'll need to express it as a string to be able to save.

Try the following :

import org.apache.spark.sql.functions._

val stringify = udf((vs: Seq[String]) => vs match {
  case null => null
  case _    => s"""[${vs.mkString(",")}]"""
})

df.withColumn("ArrayOfString", stringify($"ArrayOfString")).write.csv(...)

import org.apache.spark.sql.Column

def stringify(c: Column) = concat(lit("["), concat_ws(",", c), lit("]"))

df.withColumn("ArrayOfString", stringify($"ArrayOfString")).write.csv(...)

Pyspark implementation.

In this example, change the field column_as_array to column_as_string before saving.

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

def array_to_string(my_list):
    return '[' + ','.join([str(elem) for elem in my_list]) + ']'

array_to_string_udf = udf(array_to_string, StringType())

df = df.withColumn('column_as_str', array_to_string_udf(df["column_as_array"]))

Then you can drop the old column (array type) before saving.

df.drop("column_as_array").write.csv(...)

Here is a method for converting all ArrayType (of any underlying type) columns of a DataFrame to StringType columns:

def stringifyArrays(dataFrame: DataFrame): DataFrame = {
  val colsToStringify = dataFrame.schema.filter(p => p.dataType.typeName == "array").map(p => p.name)
  colsToStringify.foldLeft(dataFrame)((df, c) => {
    df.withColumn(c, concat(lit("["), concat_ws(", ", col(c).cast("array<string>")), lit("]")))
  })
}

Also, it doesn't use a UDF.

No need for a UDF if you already know which fields contain arrays. You can simply use Spark's cast function:

import org.apache.spark.sql.functions._
val dumpCSV = df.withColumn("ArrayOfString", col("ArrayOfString").cast("string"))
                .write
                .csv(path="/home/me/saveDF")

Hope that helps.

Spark 2.0.x dump a csv file from a dataframe containing one array of type string

Tags:

Csv

Arrays

Apache Spark

Related

Recent Posts