How to get names of columns with missing values in PySpark

This article is a part of my "100 data engineering tutorials in 100 days" challenge. (73/100)

When we do data validation in PySpark, it is common to need all columns’ column names with null values. In this article, I show how to get those names for every row in the DataFrame.

First, I assume that we have a DataFrame df and an array all_columns, which contains the names of the columns we want to validate.

We have to create a column containing an array of strings that denote the column names with null values. Therefore, we have to use the when function to check whether the value is null and pass the column names as the literal value. We use the * to unpack the array produced by for comprehension into a Spark array:

missing_column_names = array(*[
    when(col(c).isNull(),lit(c)) for c in all_column

After that, we assign the values to a new column in the DataFrame:

df = df.withColumn("missing_columns", missing_column_names)

Subscribe to the newsletter and join the free email course.

Remember to share on social media!
If you like this text, please share it on Facebook/Twitter/LinkedIn/Reddit or other social media.

If you want to contact me, send me a message on LinkedIn or Twitter.

Would you like to have a call and talk? Please schedule a meeting using this link.

Bartosz Mikulski
Bartosz Mikulski * MLOps Engineer / data engineer * conference speaker * co-founder of Software Craft Poznan & Poznan Scala User Group

Subscribe to the newsletter and get access to my free email course on building trustworthy data pipelines.