How to flatten a struct in a Spark DataFrame?

This article will show you how to extract the struct field and convert them into separate columns in a Spark DataFrame.

Let’s assume that I have the following DataFrame, and the to_be_flattened column contains a struct with two fields:

+-------------------+
|    to_be_flattened|
+-------------------+
|  [1183, Amsterdam]|
|    [06123, Ankara]|
| [08067, Barcelona]|
|       [3030, Bern]|
|     [75116, Paris]|
| [1149-014, Lisbon]|
|   [00-999, Warsaw]|
|      [00199, Rome]|
|[HR-10 040, Zagreb]|
+-------------------+

Extracting those fields into columns is trivial, and we need only this line of code to achieve it:

df.select(col('to_be_flattened.*'))

As a result, we get this DataFrame:

+-----------+---------+
|postal_code|     city|
+-----------+---------+
|       1183|Amsterdam|
|      06123|   Ankara|
|      08067|Barcelona|
|       3030|     Bern|
|      75116|    Paris|
|   1149-014|   Lisbon|
|     00-999|   Warsaw|
|      00199|     Rome|
|  HR-10 040|   Zagreb|
+-----------+---------+

We have lost the original column name. What if I wanted to prefix the extracted columns with its previous name, and instead of postal_code and city have columns to_be_flattened_postal_code and to_be_flattened_city?

We can do it by getting the field names from the struct schema, iterating over them, and adding the prefix to every field:

df.select(col('to_be_flattened.*')) \
    .select([col(c).alias('to_be_flattened_' + c) for c in struct_schema.fieldNames()])
+---------------------------+--------------------+
|to_be_flattened_postal_code|to_be_flattened_city|
+---------------------------+--------------------+
|                       1183|           Amsterdam|
|                      06123|              Ankara|
|                      08067|           Barcelona|
|                       3030|                Bern|
|                      75116|               Paris|
|                   1149-014|              Lisbon|
|                     00-999|              Warsaw|
|                      00199|                Rome|
|                  HR-10 040|              Zagreb|
+---------------------------+--------------------+
Older post

What is the difference between CUBE and ROLLUP and how to use it in Apache Spark?

Desc: How to use the cube and rollup functions in Apache Spark or PySpark. What is the difference between a cube and a rollup.

Newer post

When to cache an Apache Spark DataFrame?

Should we cache everything in Apache Spark, or are there any rules?