Spark DataFrame TimestampType - how to get Year, Month, Day values from field?
Actually, we really do not need to import any python library. We can separate the year, month, date using simple SQL. See the below example,
+----------+
| _c0|
+----------+
|1872-11-30|
|1873-03-08|
|1874-03-07|
|1875-03-06|
|1876-03-04|
|1876-03-25|
|1877-03-03|
|1877-03-05|
|1878-03-02|
|1878-03-23|
|1879-01-18|
I have a date column in my data frame which contains the date, month and year and assume I want to extract only the year from the column.
df.createOrReplaceTempView("res")
sqlDF = spark.sql("SELECT EXTRACT(year from `_c0`) FROM res ")
Here I'm creating a temporary view and store the year values using this single line and the output will be,
+-----------------------+
|year(CAST(_c0 AS DATE))|
+-----------------------+
| 1872|
| 1873|
| 1874|
| 1875|
| 1876|
| 1876|
| 1877|
| 1877|
| 1878|
| 1878|
| 1879|
| 1879|
| 1879|
Since Spark 1.5 you can use a number of date processing functions:
pyspark.sql.functions.year
pyspark.sql.functions.month
pyspark.sql.functions.dayofmonth
pyspark.sql.functions.dayofweek
pyspark.sql.functions.dayofyear
pyspark.sql.functions.weekofyear
import datetime
from pyspark.sql.functions import year, month, dayofmonth
elevDF = sc.parallelize([
(datetime.datetime(1984, 1, 1, 0, 0), 1, 638.55),
(datetime.datetime(1984, 1, 1, 0, 0), 2, 638.55),
(datetime.datetime(1984, 1, 1, 0, 0), 3, 638.55),
(datetime.datetime(1984, 1, 1, 0, 0), 4, 638.55),
(datetime.datetime(1984, 1, 1, 0, 0), 5, 638.55)
]).toDF(["date", "hour", "value"])
elevDF.select(
year("date").alias('year'),
month("date").alias('month'),
dayofmonth("date").alias('day')
).show()
# +----+-----+---+
# |year|month|day|
# +----+-----+---+
# |1984| 1| 1|
# |1984| 1| 1|
# |1984| 1| 1|
# |1984| 1| 1|
# |1984| 1| 1|
# +----+-----+---+
You can use simple map
as with any other RDD:
elevDF = sqlContext.createDataFrame(sc.parallelize([
Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=1, value=638.55),
Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=2, value=638.55),
Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=3, value=638.55),
Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=4, value=638.55),
Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=5, value=638.55)]))
(elevDF
.map(lambda (date, hour, value): (date.year, date.month, date.day))
.collect())
and the result is:
[(1984, 1, 1), (1984, 1, 1), (1984, 1, 1), (1984, 1, 1), (1984, 1, 1)]
By the way, datetime.datetime
stores the hour anyway so keeping it separately seems to be a waste of memory.
You can use functions in pyspark.sql.functions
: functions like year
, month
, etc
Refer to PySpark's official DataFrame documentation for details about available functions.
from pyspark.sql.functions import *
newdf = elevDF.select(year(elevDF.date).alias('dt_year'), month(elevDF.date).alias('dt_month'), dayofmonth(elevDF.date).alias('dt_day'), dayofyear(elevDF.date).alias('dt_dayofy'), hour(elevDF.date).alias('dt_hour'), minute(elevDF.date).alias('dt_min'), weekofyear(elevDF.date).alias('dt_week_no'), unix_timestamp(elevDF.date).alias('dt_int'))
newdf.show()
+-------+--------+------+---------+-------+------+----------+----------+
|dt_year|dt_month|dt_day|dt_dayofy|dt_hour|dt_min|dt_week_no| dt_int|
+-------+--------+------+---------+-------+------+----------+----------+
| 2015| 9| 6| 249| 0| 0| 36|1441497601|
| 2015| 9| 6| 249| 0| 0| 36|1441497601|
| 2015| 9| 6| 249| 0| 0| 36|1441497603|
| 2015| 9| 6| 249| 0| 1| 36|1441497694|
| 2015| 9| 6| 249| 0| 20| 36|1441498808|
| 2015| 9| 6| 249| 0| 20| 36|1441498811|
| 2015| 9| 6| 249| 0| 20| 36|1441498815|