How to flatten a struct in a Spark DataFrame?

This article is a part of my "100 data engineering tutorials in 100 days" challenge. (10/100)

This article will show you how to extract the struct field and convert them into separate columns in a Spark DataFrame.

Let’s assume that I have the following DataFrame, and the to_be_flattened column contains a struct with two fields:

1
2
3
4
5
6
7
8
9
10
11
12
13
+-------------------+
|    to_be_flattened|
+-------------------+
|  [1183, Amsterdam]|
|    [06123, Ankara]|
| [08067, Barcelona]|
|       [3030, Bern]|
|     [75116, Paris]|
| [1149-014, Lisbon]|
|   [00-999, Warsaw]|
|      [00199, Rome]|
|[HR-10 040, Zagreb]|
+-------------------+

Extracting those fields into columns is trivial, and we need only this line of code to achieve it:

1
df.select(col('to_be_flattened.*'))

As a result, we get this DataFrame:

1
2
3
4
5
6
7
8
9
10
11
12
13
+-----------+---------+
|postal_code|     city|
+-----------+---------+
|       1183|Amsterdam|
|      06123|   Ankara|
|      08067|Barcelona|
|       3030|     Bern|
|      75116|    Paris|
|   1149-014|   Lisbon|
|     00-999|   Warsaw|
|      00199|     Rome|
|  HR-10 040|   Zagreb|
+-----------+---------+


We have lost the original column name. What if I wanted to prefix the extracted columns with its previous name, and instead of postal_code and city have columns to_be_flattened_postal_code and to_be_flattened_city?

We can do it by getting the field names from the struct schema, iterating over them, and adding the prefix to every field:

1
2
df.select(col('to_be_flattened.*')) \
    .select([col(c).alias('to_be_flattened_' + c) for c in struct_schema.fieldNames()])
1
2
3
4
5
6
7
8
9
10
11
12
13
+---------------------------+--------------------+
|to_be_flattened_postal_code|to_be_flattened_city|
+---------------------------+--------------------+
|                       1183|           Amsterdam|
|                      06123|              Ankara|
|                      08067|           Barcelona|
|                       3030|                Bern|
|                      75116|               Paris|
|                   1149-014|              Lisbon|
|                     00-999|              Warsaw|
|                      00199|                Rome|
|                  HR-10 040|              Zagreb|
+---------------------------+--------------------+

Remember to share on social media!
If you like this text, please share it on Facebook/Twitter/LinkedIn/Reddit or other social media.

If you want to contact me, send me a message on LinkedIn or Twitter.

Would you like to have a call and talk? Please schedule a meeting using this link.


Bartosz Mikulski
Bartosz Mikulski * data/machine learning engineer * conference speaker * co-founder of Software Craft Poznan & Poznan Scala User Group