Use regexp_replace to replace a matched string with a value of another column in PySpark

When we look at the documentation of regexp_replace, we see that it accepts three parameters:

  • the name of the column
  • the regular expression
  • the replacement text

Unfortunately, we cannot specify the column name as the third parameter and use the column value as the replacement.

If I have the following DataFrame and use the regex_replace function to substitute the numbers with the content of the b_column:

from pyspark.sql import DataFrame

from pyspark.sql.types import *
from pyspark.sql import functions as F

df_schema = StructType([StructField('a_column', StringType()), StructField('b_column', StringType())])

test_list = [
    ['aaa123', 'aaa'],
    ['bbb243', 'bbb']
]

df = spark_session.createDataFrame(test_list, schema=df_schema)

df \
    .withColumn('replaced', F.regexp_replace('a_column', '\d{3}', 'b_column')) \
    .show()

The resulting DataFrame is not even a little bit similar to the expected output:

+--------+--------+-----------+
|a_column|b_column|   replaced|
+--------+--------+-----------+
|  aaa123|     aaa|aaab_column|
|  bbb243|     bbb|bbbb_column|
+--------+--------+-----------+

The first solution that comes to mind is using the col function as the third parameter:

.withColumn('replaced', F.regexp_replace('a_column', '\d{3}', F.col('b_column'))) \

This attempt fails too because we get TypeError: Column is not iterable error.

To solve the problem, we have to use the expr function and define the operation as a string:

.withColumn('replaced', F.expr("regexp_replace(a_column, '([0-9]{3})', b_column)"))
Older post

How to read multiple Parquet files with different schemas in Apache Spark

What to do when Apache Spark skips Parquet files with incompatible schemas

Newer post

Pass parameters to SQL query when using PostgresOperator in Airflow

How to pass parameters to SQL template when using PostgresOperator in Airflow