Use regexp_replace to replace a matched string with a value of another column in PySpark

This article is a part of my "100 data engineering tutorials in 100 days" challenge. (44/100)

When we look at the documentation of regexp_replace, we see that it accepts three parameters:

  • the name of the column
  • the regular expression
  • the replacement text

Unfortunately, we cannot specify the column name as the third parameter and use the column value as the replacement.

If I have the following DataFrame and use the regex_replace function to substitute the numbers with the content of the b_column:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
from pyspark.sql import DataFrame

from pyspark.sql.types import *
from pyspark.sql import functions as F

df_schema = StructType([StructField('a_column', StringType()), StructField('b_column', StringType())])

test_list = [
    ['aaa123', 'aaa'],
    ['bbb243', 'bbb']
]

df = spark_session.createDataFrame(test_list, schema=df_schema)

df \
    .withColumn('replaced', F.regexp_replace('a_column', '\d{3}', 'b_column')) \
    .show()

The resulting DataFrame is not even a little bit similar to the expected output:

1
2
3
4
5
6
+--------+--------+-----------+
|a_column|b_column|   replaced|
+--------+--------+-----------+
|  aaa123|     aaa|aaab_column|
|  bbb243|     bbb|bbbb_column|
+--------+--------+-----------+

The first solution that comes to mind is using the col function as the third parameter:

1
.withColumn('replaced', F.regexp_replace('a_column', '\d{3}', F.col('b_column'))) \

This attempt fails too because we get TypeError: Column is not iterable error.

To solve the problem, we have to use the expr function and define the operation as a string:

1
.withColumn('replaced', F.expr("regexp_replace(a_column, '([0-9]{3})', b_column)"))

Subscribe to the newsletter and join the free email course.


Remember to share on social media!
If you like this text, please share it on Facebook/Twitter/LinkedIn/Reddit or other social media.

If you want to contact me, send me a message on LinkedIn or Twitter.

Would you like to have a call and talk? Please schedule a meeting using this link.


Bartosz Mikulski
Bartosz Mikulski * data/machine learning engineer * conference speaker * co-founder of Software Craft Poznan & Poznan Scala User Group

Subscribe to the newsletter and get access to my free email course on building trustworthy data pipelines.

Do you want to work with me at riskmethods?

REMOTE position (available in Poland or Germany)