What is the right Date/Datetime format in JSON for Spark SQL to automatically infer the schema for it?

It is possible to infer dates using a format of your choosing (I used the Date.toJSON format) with a little modification and also have reasonable performance.

Get the latest maintenance branch:

git clone https://github.com/apache/spark.git
cd spark
git checkout branch-1.4

Replace the following block in InferSchema:

  case VALUE_STRING if parser.getTextLength < 1 =>
    // Zero length strings and nulls have special handling to deal
    // with JSON generators that do not distinguish between the two.
    // To accurately infer types for empty strings that are really
    // meant to represent nulls we assume that the two are isomorphic
    // but will defer treating null fields as strings until all the
    // record fields' types have been combined.
    NullType

  case VALUE_STRING => StringType

with the following code:

  case VALUE_STRING =>
    val len = parser.getTextLength
    if (len < 1) {
      NullType
    } else if (len == 24) {
      // try to match dates of the form "1968-01-01T12:34:56.789Z"
      // for performance, only try parsing if text is 24 chars long and ends with a Z
      val chars = parser.getTextCharacters
      val offset = parser.getTextOffset
      if (chars(offset + len - 1) == 'Z') {
        try {
          org.apache.spark.sql.catalyst.util.
            DateUtils.stringToTime(new String(chars, offset, len))
          TimestampType
        } catch {
          case e: Exception => StringType
        }
      } else {
        StringType
      }
    } else {
      StringType
    }

Build Spark according to your setup. I used:

mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -DskipTests=true clean install

To test, create a file named datedPeople.json at the top level which contains the following data:

{"name":"Andy", "birthdate": "2012-04-23T18:25:43.511Z"}
{"name":"Bob"}
{"name":"This has 24 characters!!", "birthdate": "1988-11-24T11:21:13.121Z"}
{"name":"Dolla Dolla BillZZZZZZZZ", "birthdate": "1968-01-01T12:34:56.789Z"}

Read in the file. Make sure that you set the conf option before using sqlContext at all, or it won't work. Dates!

.\bin\spark-shell.cmd
scala> sqlContext.setConf("spark.sql.json.useJacksonStreamingAPI", "true")
scala> val datedPeople = sqlContext.read.json("datedPeople.json")

datedPeople: org.apache.spark.sql.DataFrame = [birthdate: timestamp, name: string]

scala> datedPeople.foreach(println)

[2012-04-23 13:25:43.511,Andy]
[1968-01-01 06:34:56.789,Dolla Dolla BillZZZZZZZZ]
[null,Bob]
[1988-11-24 05:21:13.121,This has 24 characters!!]

The JSON type inference will never infer date types. Non-zero-length strings are always inferred to be strings. Source code:

private[sql] object InferSchema {
  // ...
  private def inferField(parser: JsonParser): DataType = {
    import com.fasterxml.jackson.core.JsonToken._
    parser.getCurrentToken match {
      // ...
      case VALUE_STRING => StringType
      // ...
    }
  }
  // ...
}

For automatic detection this would have to be changed to look at the actual string (parser.getValueAsString) and based on the format return DateType when appropriate.

It's probably simpler to just take the normal auto-generated schema and convert the date types as a second step.

Another option would be to read a small sample of the data (without using Spark) and infer the schema yourself. Then use your schema to create the DataFrame. This avoids some computation as well.

What is the right Date/Datetime format in JSON for Spark SQL to automatically infer the schema for it?

Tags:

Apache Spark

Apache Spark Sql

Related

Recent Posts