Use regexp_replace to replace a matched string with a value of another column in PySpark

This article is a part of my "100 data engineering tutorials in 100 days" challenge. (44/100)

When we look at the documentation of regexp_replace, we see that it accepts three parameters:

  • the name of the column
  • the regular expression
  • the replacement text

Unfortunately, we cannot specify the column name as the third parameter and use the column value as the replacement.

If I have the following DataFrame and use the regex_replace function to substitute the numbers with the content of the b_column:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
from pyspark.sql import DataFrame

from pyspark.sql.types import *
from pyspark.sql import functions as F

df_schema = StructType([StructField('a_column', StringType()), StructField('b_column', StringType())])

test_list = [
    ['aaa123', 'aaa'],
    ['bbb243', 'bbb']
]

df = spark_session.createDataFrame(test_list, schema=df_schema)

df \
    .withColumn('replaced', F.regexp_replace('a_column', '\d{3}', 'b_column')) \
    .show()

The resulting DataFrame is not even a little bit similar to the expected output:

1
2
3
4
5
6
+--------+--------+-----------+
|a_column|b_column|   replaced|
+--------+--------+-----------+
|  aaa123|     aaa|aaab_column|
|  bbb243|     bbb|bbbb_column|
+--------+--------+-----------+

The first solution that comes to mind is using the col function as the third parameter:

1
.withColumn('replaced', F.regexp_replace('a_column', '\d{3}', F.col('b_column'))) \

This attempt fails too because we get TypeError: Column is not iterable error.

To solve the problem, we have to use the expr function and define the operation as a string:

1
.withColumn('replaced', F.expr("regexp_replace(a_column, '([0-9]{3})', b_column)"))



Remember to share on social media!
If you like this text, please share it on Facebook/Twitter/LinkedIn/Reddit or other social media.

If you want to contact me, send me a message on LinkedIn or Twitter.

Would you like to have a call and talk? Please schedule a meeting using this link.


Bartosz Mikulski
Bartosz Mikulski * data/machine learning engineer * conference speaker * co-founder of Software Craft Poznan & Poznan Scala User Group