How to use double pipe as delimiter in CSV?

I ran into this and found a good solution, I am using spark 2.3, I have a feeling it should work all of spark 2.2+ but have not tested it. The way it works is I replace the || with a tab and then the built in csv can take a Dataset[String] . I used tab because I have commas in my data.

var df = spark.sqlContext.read
  .option("header", "true")
  .option("inferSchema", "true")
  .option("delimiter", "\t")
  .csv(spark.sqlContext.read.textFile("filename")
      .map(line => line.split("\\|\\|").mkString("\t")))

Hope this helps some else.

EDIT:

As of spark 3.0.1 this works out of the box.

example:

val ds = List("name||id", "foo||12", "brian||34", """"cray||name"||123""", "cray||name||123").toDS
ds: org.apache.spark.sql.Dataset[String] = [value: string]

val csv = spark.read.option("header", "true").option("inferSchema", "true").option("delimiter", "||").csv(ds)
csv: org.apache.spark.sql.DataFrame = [name: string, id: string]

csv.show
+----------+----+
|      name|  id|
+----------+----+
|       foo|  12|
|     brian|  34|
|cray||name| 123|
|      cray|name|
+----------+----+

So the actual error being emitted here is:

java.lang.IllegalArgumentException: Delimiter cannot be more than one character: ¦¦

The docs corroborate this limitation and I checked the Spark 2.0 csv reader and it has the same requirement.

Given all of this, if your data is simple enough where you won't have entries containing ¦¦, I would load your data like so:

scala> :pa
// Entering paste mode (ctrl-D to finish)
val customSchema_1 = StructType(Array(
    StructField("ID", StringType, true), 
    StructField("FILLER", StringType, true), 
    StructField("CODE", StringType, true)));

// Exiting paste mode, now interpreting.
customSchema_1: org.apache.spark.sql.types.StructType = StructType(StructField(ID,StringType,true), StructField(FILLER,StringType,true), StructField(CODE,StringType,true))

scala> val rawData = sc.textFile("example.txt")
rawData: org.apache.spark.rdd.RDD[String] = example.txt MapPartitionsRDD[1] at textFile at <console>:31

scala> import org.apache.spark.sql.Row
import org.apache.spark.sql.Row

scala> val rowRDD = rawData.map(line => Row.fromSeq(line.split("¦¦")))
rowRDD: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[3] at map at <console>:34

scala> val df = sqlContext.createDataFrame(rowRDD, customSchema_1)
df: org.apache.spark.sql.DataFrame = [ID: string, FILLER: string, CODE: string]

scala> df.show
+-----+------+----+
|   ID|FILLER|CODE|
+-----+------+----+
|12345|      |  10|
+-----+------+----+

How to use double pipe as delimiter in CSV?

Tags:

Scala

Apache Spark

Related

Recent Posts