How to force inferSchema for CSV to consider integers as dates (with "dateFormat" option)?
If my understanding is correct, the code implies the following order of type inference (with the first types being checked against first):
NullType
IntegerType
LongType
DecimalType
DoubleType
TimestampType
BooleanType
StringType
With that, I think the issue is that 20171001
matches IntegerType
before even considering TimestampType
(which uses timestampFormat
not dateFormat
option).
One solution would be to define the schema and use it with schema
operator (of DataFrameReader
) or let Spark SQL infer the schema and use cast
operator.
I'd choose the former if the number of fields is not high.
In this case you simply cannot depend on the schema inference due to format ambiguity.
Since input can be parsed both as IntegerType
(or any higher precision numeric format) as well as TimestamType
and the former one has higher precedence (internally Spark tries IntegerType
-> LongType
-> DecimaType
-> DoubleType
-> TimestampType
) inference mechanism will never reach TimestampType
case.
To be specific, with schema inference enabled, Spark will call tryParseInteger
, which will correctly parse the input and stop. Subsequent call will match the second case and finish at the same tryParseInteger
call.