How to concatenate columns in a PySpark DataFrame

This article is a part of my "100 data engineering tutorials in 100 days" challenge. (8/100)

In the previous article, I described how to split a single column into multiple columns. In this one, I will show you how to do the opposite and merge multiple columns into one column.

Suppose that I have the following DataFrame, and I would like to create a column that contains the values from both of those columns with a single space in between:

1
2
3
4
5
6
7
8
9
+--------+--------+
|column_A|column_B|
+--------+--------+
| val_A_1| val_A_2|
| val_B_1| val_B_2|
|    null| val_B_3|
| val_A_4|    null|
|    null|    null|
+--------+--------+

In Spark, we can do that in two ways that give us a slightly different result.

concat

The first method is to use the concat function:

1
df.withColumn('joined', concat(col('column_A'), lit(' '), col('column_B')))
1
2
3
4
5
6
7
8
9
+--------+--------+---------------+
|column_A|column_B|         joined|
+--------+--------+---------------+
| val_A_1| val_A_2|val_A_1 val_A_2|
| val_B_1| val_B_2|val_B_1 val_B_2|
|    null| val_B_3|           null|
| val_A_4|    null|           null|
|    null|    null|           null|
+--------+--------+---------------+

concat_ws

We see that if any of the values is null, we will get null as a result. What if we prefer to ignore the null values and concatenate the remaining columns? Of course, we could use the nvl function to replace nulls with empty strings or the when function to build conditional expressions, but there is an easier method.

To eliminate the null values without breaking the concatenation, we can use the concat_ws function. That function works a little bit differently than the concat. As the first parameter, we must pass the separator that it will put between all of the columns. After that, we specify the columns to merge:

1
df.withColumn('joined', concat_ws(' ', col('column_A'), col('column_B')))

As a result, we get the following DataFrame:

1
2
3
4
5
6
7
8
9
+--------+--------+---------------+
|column_A|column_B|         joined|
+--------+--------+---------------+
| val_A_1| val_A_2|val_A_1 val_A_2|
| val_B_1| val_B_2|val_B_1 val_B_2|
|    null| val_B_3|        val_B_3|
| val_A_4|    null|        val_A_4|
|    null|    null|               |
+--------+--------+---------------+

Note that the last row contains an empty string, not a string with a single space inside! I can prove that using the length function:

1
2
3
df \
        .withColumn('joined', concat_ws(' ', col('column_A'), col('column_B'))) \
        .withColumn('length', length(col('joined')))
1
2
3
4
5
6
7
8
9
+--------+--------+---------------+------+
|column_A|column_B|         joined|length|
+--------+--------+---------------+------+
| val_A_1| val_A_2|val_A_1 val_A_2|    15|
| val_B_1| val_B_2|val_B_1 val_B_2|    15|
|    null| val_B_3|        val_B_3|     7|
| val_A_4|    null|        val_A_4|     7|
|    null|    null|               |     0|
+--------+--------+---------------+------+

The Wrong Way

To show that using the when function is a terrible idea, let’s try to achieve the same outcome as the concat_ws function.

First, I have to check whether a column is null and return an empty string in such a case:

1
2
when(col('column_A').isNotNull(), col('column_A')).otherwise(lit(''))
when(col('column_B').isNotNull(), col('column_B')).otherwise(lit(''))

In addition to all of that code, I have to check whether both columns are not null. If any of them is null, I will have to use an empty string instead of space as the second argument:

1
when((col('column_A').isNotNull()) & (col('column_B').isNotNull()), lit(' ')).otherwise(lit(''))

The full code looks like this:

1
2
3
4
5
6
df \
    .withColumn('null_to_blank', concat(
        when(col('column_A').isNotNull(), col('column_A')).otherwise(lit('')),
        when((col('column_A').isNotNull()) & (col('column_B').isNotNull()), lit(' ')).otherwise(lit('')),
        when(col('column_B').isNotNull(), col('column_B')).otherwise(lit(''))
    ))

Do not concatenate columns like this. Use concat_ws instead.

Did you enjoy reading this article?
Would you like to learn more about software craft in data engineering and MLOps?

Subscribe to the newsletter or add this blog to your RSS reader (does anyone still use them?) to get a notification when I publish a new essay!

Newsletter

Do you enjoy reading my articles?
Subscribe to the newsletter if you don't want to miss the new content, business offers, and free training materials.

Bartosz Mikulski

Bartosz Mikulski

  • Data/MLOps engineer by day
  • DevRel/copywriter by night
  • Python and data engineering trainer
  • Conference speaker
  • Contributed a chapter to the book "97 Things Every Data Engineer Should Know"
  • Twitter: @mikulskibartosz
Newsletter

Do you enjoy reading my articles?
Subscribe to the newsletter if you don't want to miss the new content, business offers, and free training materials.