When we look at the documentation of regexp_replace, we see that it accepts three parameters:

  • the name of the column
  • the regular expression
  • the replacement text

Unfortunately, we cannot specify the column name as the third parameter and use the column value as the replacement.

If I have the following DataFrame and use the regex_replace function to substitute the numbers with the content of the b_column:

from pyspark.sql import DataFrame

from pyspark.sql.types import *
from pyspark.sql import functions as F

df_schema = StructType([StructField('a_column', StringType()), StructField('b_column', StringType())])

test_list = [
    ['aaa123', 'aaa'],
    ['bbb243', 'bbb']
]

df = spark_session.createDataFrame(test_list, schema=df_schema)

df \
    .withColumn('replaced', F.regexp_replace('a_column', '\d{3}', 'b_column')) \
    .show()

The resulting DataFrame is not even a little bit similar to the expected output:

+--------+--------+-----------+
|a_column|b_column|   replaced|
+--------+--------+-----------+
|  aaa123|     aaa|aaab_column|
|  bbb243|     bbb|bbbb_column|
+--------+--------+-----------+

The first solution that comes to mind is using the col function as the third parameter:

.withColumn('replaced', F.regexp_replace('a_column', '\d{3}', F.col('b_column'))) \

This attempt fails too because we get TypeError: Column is not iterable error.

To solve the problem, we have to use the expr function and define the operation as a string:

.withColumn('replaced', F.expr("regexp_replace(a_column, '([0-9]{3})', b_column)"))
Subscribe to the newsletter
Now Enrolling: A new cohort for my premium course on fixing AI hallucinations. Limited 'Founding Member' spots available. Learn more