Cast column containing multiple string date formats to DateTime in Spark
Personally I would recommend using SQL functions directly without expensive and inefficient reformatting:
from pyspark.sql.functions import coalesce, to_date
def to_date_(col, formats=("MM/dd/yyyy", "yyyy-MM-dd")):
# Spark 2.2 or later syntax, for < 2.2 use unix_timestamp and cast
return coalesce(*[to_date(col, f) for f in formats])
This will choose the first format, which can successfully parse input string.
Usage:
df = spark.createDataFrame([(1, "01/22/2010"), (2, "2018-12-01")], ("id", "dt"))
df.withColumn("pdt", to_date_("dt")).show()
+---+----------+----------+
| id| dt| pdt|
+---+----------+----------+
| 1|01/22/2010|2010-01-22|
| 2|2018-12-01|2018-12-01|
+---+----------+----------+
It will be faster than udf
, and adding new formats is just a matter of adjusting formats
parameter.
However it won't help you with format ambiguities. In general case it might not be possible to do it without manual intervention and cross referencing with external data.
The same thing can be of course done in Scala:
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions.{coalesce, to_date}
def to_date_(col: Column,
formats: Seq[String] = Seq("MM/dd/yyyy", "yyyy-MM-dd")) = {
coalesce(formats.map(f => to_date(col, f)): _*)
}