How to concatenate columns in a PySpark DataFrame

This article is a part of my "100 data engineering tutorials in 100 days" challenge. (8/100)

In the previous article, I described how to split a single column into multiple columns. In this one, I will show you how to do the opposite and merge multiple columns into one column.

Suppose that I have the following DataFrame, and I would like to create a column that contains the values from both of those columns with a single space in between:

1
2
3
4
5
6
7
8
9
+--------+--------+
|column_A|column_B|
+--------+--------+
| val_A_1| val_A_2|
| val_B_1| val_B_2|
|    null| val_B_3|
| val_A_4|    null|
|    null|    null|
+--------+--------+

In Spark, we can do that in two ways that give us a slightly different result.

concat

The first method is to use the concat function:

1
df.withColumn('joined', concat(col('column_A'), lit(' '), col('column_B')))
1
2
3
4
5
6
7
8
9
+--------+--------+---------------+
|column_A|column_B|         joined|
+--------+--------+---------------+
| val_A_1| val_A_2|val_A_1 val_A_2|
| val_B_1| val_B_2|val_B_1 val_B_2|
|    null| val_B_3|           null|
| val_A_4|    null|           null|
|    null|    null|           null|
+--------+--------+---------------+


concat_ws

We see that if any of the values is null, we will get null as a result. What if we prefer to ignore the null values and concatenate the remaining columns? Of course, we could use the nvl function to replace nulls with empty strings or the when function to build conditional expressions, but there is an easier method.

To eliminate the null values without breaking the concatenation, we can use the concat_ws function. That function works a little bit differently than the concat. As the first parameter, we must pass the separator that it will put between all of the columns. After that, we specify the columns to merge:

1
df.withColumn('joined', concat_ws(' ', col('column_A'), col('column_B')))

As a result, we get the following DataFrame:

1
2
3
4
5
6
7
8
9
+--------+--------+---------------+
|column_A|column_B|         joined|
+--------+--------+---------------+
| val_A_1| val_A_2|val_A_1 val_A_2|
| val_B_1| val_B_2|val_B_1 val_B_2|
|    null| val_B_3|        val_B_3|
| val_A_4|    null|        val_A_4|
|    null|    null|               |
+--------+--------+---------------+

Note that the last row contains an empty string, not a string with a single space inside! I can prove that using the length function:

1
2
3
df \
        .withColumn('joined', concat_ws(' ', col('column_A'), col('column_B'))) \
        .withColumn('length', length(col('joined')))
1
2
3
4
5
6
7
8
9
+--------+--------+---------------+------+
|column_A|column_B|         joined|length|
+--------+--------+---------------+------+
| val_A_1| val_A_2|val_A_1 val_A_2|    15|
| val_B_1| val_B_2|val_B_1 val_B_2|    15|
|    null| val_B_3|        val_B_3|     7|
| val_A_4|    null|        val_A_4|     7|
|    null|    null|               |     0|
+--------+--------+---------------+------+

The Wrong Way

To show that using the when function is a terrible idea, let’s try to achieve the same outcome as the concat_ws function.

First, I have to check whether a column is null and return an empty string in such a case:

1
2
when(col('column_A').isNotNull(), col('column_A')).otherwise(lit(''))
when(col('column_B').isNotNull(), col('column_B')).otherwise(lit(''))

In addition to all of that code, I have to check whether both columns are not null. If any of them is null, I will have to use an empty string instead of space as the second argument:

1
when((col('column_A').isNotNull()) & (col('column_B').isNotNull()), lit(' ')).otherwise(lit(''))

The full code looks like this:

1
2
3
4
5
6
df \
    .withColumn('null_to_blank', concat(
        when(col('column_A').isNotNull(), col('column_A')).otherwise(lit('')),
        when((col('column_A').isNotNull()) & (col('column_B').isNotNull()), lit(' ')).otherwise(lit('')),
        when(col('column_B').isNotNull(), col('column_B')).otherwise(lit(''))
    ))

Do not concatenate columns like this. Use concat_ws instead.


Remember to share on social media!
If you like this text, please share it on Facebook/Twitter/LinkedIn/Reddit or other social media.

If you want to contact me, send me a message on LinkedIn or Twitter.

Would you like to have a call and talk? Please schedule a meeting using this link.


Bartosz Mikulski
Bartosz Mikulski * data/machine learning engineer * conference speaker * co-founder of Software Craft Poznan & Poznan Scala User Group