Working with dates and time in Apache Spark

Today, I will demonstrate how to work with dates in Apache Spark. In particular, I want to cover things like getting the current date and the current time (with and without the timezone), and calculating a relative date (for example, yesterday, a month ago, the first day of the previous month, etc.).

I am going to write the code using PySpark, but the API should work the same in the Scala version of Apache Spark.

The current date and the current time

There are two ways to get the current date in PySpark. We can either get only the date or the date with the time.

Let’s start by creating a DataFrame that contains only one column and one row. In the second step, we will add two columns with the date and timestamp. We must remember that both of those functions return the time in the system timezone.

df_schema = StructType([StructField('a_column', StringType())])
    test_list = [
        ['1']
    ]

    df: DataFrame = spark_session.createDataFrame(test_list, schema=df_schema)

    df \
        .withColumn('current_date', current_date()) \
        .withColumn('current_timestamp', current_timestamp())
+--------+------------+--------------------+
|a_column|current_date|   current_timestamp|
+--------+------------+--------------------+
|       1|  2020-10-04|2020-10-04 11:55:...|
+--------+------------+--------------------+

Changing the timezone

To modify the timezone and get the time in UTC, we must set the spark session timezone using the session.timeZone configuration parameter:

spark_session.conf.set('spark.sql.session.timeZone', 'UTC')
+--------+------------+--------------------+
|a_column|current_date|   current_timestamp|
+--------+------------+--------------------+
|       1|  2020-10-04|2020-10-04 10:16:...|
+--------+------------+--------------------+

Relative dates

To calculate a relative date (for example, yesterday or tomorrow), we should use the date_add and date_sub functions:

df \
        .withColumn('current_date', current_date()) \
        .withColumn('yesterday', date_sub(col('current_date'), 1)) \
        .withColumn('tomorrow', date_add(col('current_date'), 1))
+--------+------------+----------+----------+
|a_column|current_date| yesterday|  tomorrow|
+--------+------------+----------+----------+
|       1|  2020-10-04|2020-10-03|2020-10-05|
+--------+------------+----------+----------+

Relative dates with months

It is easy to add or subtract dates, but what if we wanted to calculate a date in three months? The months vary in length, so we cannot add 90 days and pretend to have the correct value. Instead of that, Spark provides the add_months function, which also works with negative numbers, so we can use it to calculate the date three months ago:

df \
        .withColumn('current_date', current_date()) \
        .withColumn('in_three_months', add_months(col('current_date'), 3)) \
        .withColumn('three_months_ago', add_months(col('current_date'), -3))
+--------+------------+---------------+----------------+
|a_column|current_date|in_three_months|three_months_ago|
+--------+------------+---------------+----------------+
|       1|  2020-10-04|     2021-01-04|      2020-07-04|
+--------+------------+---------------+----------------+

The last day of the month

In addition to that, Spark supports calculating the date of the last day of the given month:

df \
        .withColumn('current_date', current_date()) \
        .withColumn('end_of_the_month', last_day(col('current_date')))
+--------+------------+----------------+
|a_column|current_date|end_of_the_month|
+--------+------------+----------------+
|       1|  2020-10-04|      2020-10-31|
+--------+------------+----------------+

The first day of the previous month

The previous month’s first day is an example of date arithmetics when the operations get complicated. There is no easy way to do that in Spark, either. Instead of that, what we have to calculate the end of the current month, subtract two months, and add one day:

df \
        .withColumn('current_date', current_date()) \
        .withColumn('first_day_of_previous_month', date_add(add_months(last_day(col('current_date')), -2), 1))
+--------+------------+---------------------------+
|a_column|current_date|first_day_of_previous_month|
+--------+------------+---------------------------+
|       1|  2020-10-04|                 2020-09-01|
+--------+------------+---------------------------+
Older post

How to save an Apache Spark DataFrame as a dynamically partitioned table in Hive

How to use the saveAsTable function to create a partitioned table

Newer post

How to get an array/bag of elements from the Hive group by operator?

How to get an array of elements from one column when grouping by another column in Hive