Working with dates and time in Apache Spark

This article is a part of my "100 data engineering tutorials in 100 days" challenge. (13/100)

Today, I will demonstrate how to work with dates in Apache Spark. In particular, I want to cover things like getting the current date and the current time (with and without the timezone), and calculating a relative date (for example, yesterday, a month ago, the first day of the previous month, etc.).

I am going to write the code using PySpark, but the API should work the same in the Scala version of Apache Spark.

The current date and the current time

There are two ways to get the current date in PySpark. We can either get only the date or the date with the time.

Let’s start by creating a DataFrame that contains only one column and one row. In the second step, we will add two columns with the date and timestamp. We must remember that both of those functions return the time in the system timezone.

1
2
3
4
5
6
7
8
9
10
df_schema = StructType([StructField('a_column', StringType())])
    test_list = [
        ['1']
    ]

    df: DataFrame = spark_session.createDataFrame(test_list, schema=df_schema)

    df \
        .withColumn('current_date', current_date()) \
        .withColumn('current_timestamp', current_timestamp())
1
2
3
4
5
+--------+------------+--------------------+
|a_column|current_date|   current_timestamp|
+--------+------------+--------------------+
|       1|  2020-10-04|2020-10-04 11:55:...|
+--------+------------+--------------------+

Changing the timezone

To modify the timezone and get the time in UTC, we must set the spark session timezone using the session.timeZone configuration parameter:

1
spark_session.conf.set('spark.sql.session.timeZone', 'UTC')
1
2
3
4
5
+--------+------------+--------------------+
|a_column|current_date|   current_timestamp|
+--------+------------+--------------------+
|       1|  2020-10-04|2020-10-04 10:16:...|
+--------+------------+--------------------+

Relative dates

To calculate a relative date (for example, yesterday or tomorrow), we should use the date_add and date_sub functions:

1
2
3
4
df \
        .withColumn('current_date', current_date()) \
        .withColumn('yesterday', date_sub(col('current_date'), 1)) \
        .withColumn('tomorrow', date_add(col('current_date'), 1))
1
2
3
4
5
+--------+------------+----------+----------+
|a_column|current_date| yesterday|  tomorrow|
+--------+------------+----------+----------+
|       1|  2020-10-04|2020-10-03|2020-10-05|
+--------+------------+----------+----------+

Relative dates with months

It is easy to add or subtract dates, but what if we wanted to calculate a date in three months? The months vary in length, so we cannot add 90 days and pretend to have the correct value. Instead of that, Spark provides the add_months function, which also works with negative numbers, so we can use it to calculate the date three months ago:

1
2
3
4
df \
        .withColumn('current_date', current_date()) \
        .withColumn('in_three_months', add_months(col('current_date'), 3)) \
        .withColumn('three_months_ago', add_months(col('current_date'), -3))
1
2
3
4
5
+--------+------------+---------------+----------------+
|a_column|current_date|in_three_months|three_months_ago|
+--------+------------+---------------+----------------+
|       1|  2020-10-04|     2021-01-04|      2020-07-04|
+--------+------------+---------------+----------------+


The last day of the month

In addition to that, Spark supports calculating the date of the last day of the given month:

1
2
3
df \
        .withColumn('current_date', current_date()) \
        .withColumn('end_of_the_month', last_day(col('current_date')))
1
2
3
4
5
+--------+------------+----------------+
|a_column|current_date|end_of_the_month|
+--------+------------+----------------+
|       1|  2020-10-04|      2020-10-31|
+--------+------------+----------------+

The first day of the previous month

The previous month’s first day is an example of date arithmetics when the operations get complicated. There is no easy way to do that in Spark, either. Instead of that, what we have to calculate the end of the current month, subtract two months, and add one day:

1
2
3
df \
        .withColumn('current_date', current_date()) \
        .withColumn('first_day_of_previous_month', date_add(add_months(last_day(col('current_date')), -2), 1))
1
2
3
4
5
+--------+------------+---------------------------+
|a_column|current_date|first_day_of_previous_month|
+--------+------------+---------------------------+
|       1|  2020-10-04|                 2020-09-01|
+--------+------------+---------------------------+

Remember to share on social media!
If you like this text, please share it on Facebook/Twitter/LinkedIn/Reddit or other social media.

If you want to contact me, send me a message on LinkedIn or Twitter.

Would you like to have a call and talk? Please schedule a meeting using this link.


Bartosz Mikulski
Bartosz Mikulski * data/machine learning engineer * conference speaker * co-founder of Software Craft Poznan & Poznan Scala User Group