What is the right Date/Datetime format in JSON for Spark SQL to automatically infer the schema for it?
It is possible to infer dates using a format of your choosing (I used the Date.toJSON format) with a little modification and also have reasonable performance.
Get the latest maintenance branch:
git clone https://github.com/apache/spark.git
cd spark
git checkout branch-1.4
Replace the following block in InferSchema:
case VALUE_STRING if parser.getTextLength < 1 =>
// Zero length strings and nulls have special handling to deal
// with JSON generators that do not distinguish between the two.
// To accurately infer types for empty strings that are really
// meant to represent nulls we assume that the two are isomorphic
// but will defer treating null fields as strings until all the
// record fields' types have been combined.
NullType
case VALUE_STRING => StringType
with the following code:
case VALUE_STRING =>
val len = parser.getTextLength
if (len < 1) {
NullType
} else if (len == 24) {
// try to match dates of the form "1968-01-01T12:34:56.789Z"
// for performance, only try parsing if text is 24 chars long and ends with a Z
val chars = parser.getTextCharacters
val offset = parser.getTextOffset
if (chars(offset + len - 1) == 'Z') {
try {
org.apache.spark.sql.catalyst.util.
DateUtils.stringToTime(new String(chars, offset, len))
TimestampType
} catch {
case e: Exception => StringType
}
} else {
StringType
}
} else {
StringType
}
Build Spark according to your setup. I used:
mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -DskipTests=true clean install
To test, create a file named datedPeople.json
at the top level which contains the following data:
{"name":"Andy", "birthdate": "2012-04-23T18:25:43.511Z"}
{"name":"Bob"}
{"name":"This has 24 characters!!", "birthdate": "1988-11-24T11:21:13.121Z"}
{"name":"Dolla Dolla BillZZZZZZZZ", "birthdate": "1968-01-01T12:34:56.789Z"}
Read in the file. Make sure that you set the conf option before using sqlContext
at all, or it won't work. Dates!
.\bin\spark-shell.cmd
scala> sqlContext.setConf("spark.sql.json.useJacksonStreamingAPI", "true")
scala> val datedPeople = sqlContext.read.json("datedPeople.json")
datedPeople: org.apache.spark.sql.DataFrame = [birthdate: timestamp, name: string]
scala> datedPeople.foreach(println)
[2012-04-23 13:25:43.511,Andy]
[1968-01-01 06:34:56.789,Dolla Dolla BillZZZZZZZZ]
[null,Bob]
[1988-11-24 05:21:13.121,This has 24 characters!!]
The JSON type inference will never infer date types. Non-zero-length strings are always inferred to be strings. Source code:
private[sql] object InferSchema {
// ...
private def inferField(parser: JsonParser): DataType = {
import com.fasterxml.jackson.core.JsonToken._
parser.getCurrentToken match {
// ...
case VALUE_STRING => StringType
// ...
}
}
// ...
}
For automatic detection this would have to be changed to look at the actual string (parser.getValueAsString
) and based on the format return DateType
when appropriate.
It's probably simpler to just take the normal auto-generated schema and convert the date types as a second step.
Another option would be to read a small sample of the data (without using Spark) and infer the schema yourself. Then use your schema to create the DataFrame. This avoids some computation as well.