How to get names of columns with missing values in PySpark
When we do data validation in PySpark, it is common to need all columns’ column names with null values. In this article, I show how to get those names for every row in the DataFrame.
First, I assume that we have a DataFrame
df and an array
all_columns, which contains the names of the columns we want to validate.
We have to create a column containing an array of strings that denote the column names with null values. Therefore, we have to use the
when function to check whether the value is null and pass the column names as the literal value. We use the
* to unpack the array produced by for comprehension into a Spark array:
1 2 3 missing_column_names = array(*[ when(col(c).isNull(),lit(c)) for c in all_column ])
After that, we assign the values to a new column in the DataFrame:
1 df = df.withColumn("missing_columns", missing_column_names)
Did you enjoy reading this article?
Would you like to learn more about software craft in data engineering and MLOps?
Subscribe to the newsletter or add this blog to your RSS reader (does anyone still use them?) to get a notification when I publish a new essay!
You may also like
- How Data Mechanics can reduce your Apache Spark costs by 70%
- How to use one SparkSession to run all Pytest tests
- How to read multiple Parquet files with different schemas in Apache Spark
- How to combine two DataFrames with no common columns in Apache Spark
- How to measure Spark performance and gather metrics about written data
- Data/MLOps engineer by day
- DevRel/copywriter by night
- Python and data engineering trainer
- Conference speaker
- Contributed a chapter to the book "97 Things Every Data Engineer Should Know"
- Twitter: @mikulskibartosz