Spark Strutured Streaming automatically converts timestamp to local time
Note:
This answer is useful primarily in Spark < 2.2. For newer Spark version see the answer by astro-asz
However we should note that as of Spark 2.4.0, spark.sql.session.timeZone
doesn't set user.timezone
(java.util.TimeZone.getDefault
). So setting spark.sql.session.timeZone
alone can result in rather awkward situation where SQL and non-SQL components use different timezone settings.
Therefore I still recommend setting user.timezone
explicitly, even if spark.sql.session.timeZone
is set.
TL;DR Unfortunately this is how Spark handles timestamps right now and there is really no built-in alternative, other than operating on epoch time directly, without using date/time utilities.
You can an insightful discussion on the Spark developers list: SQL TIMESTAMP semantics vs. SPARK-18350
The cleanest workaround I've found so far is to set -Duser.timezone
to UTC
for both the driver and executors. For example with submit:
bin/spark-shell --conf "spark.driver.extraJavaOptions=-Duser.timezone=UTC" \
--conf "spark.executor.extraJavaOptions=-Duser.timezone=UTC"
or by adjusting configuration files (spark-defaults.conf
):
spark.driver.extraJavaOptions -Duser.timezone=UTC
spark.executor.extraJavaOptions -Duser.timezone=UTC
For me it worked to use:
spark.conf.set("spark.sql.session.timeZone", "UTC")
It tells the spark SQL to use UTC as a default timezone for timestamps. I used it in spark SQL for example:
select *, cast('2017-01-01 10:10:10' as timestamp) from someTable
I know it does not work in 2.0.1. but works in Spark 2.2. I used in SQLTransformer
also and it worked.
I am not sure about streaming though.