How to return rows with missing values in Pandas DataFrame

The task is easy. I have a DataFrame which has missing values, but I don’t know where they are. I want to get a DataFrame which contains only the rows with at least one missing values.

If I look for the solution, I will most likely find this:

1
data[data.isnull().T.any().T]

It gets the job done, and it returns the correct result, but there is a better solution. Before I describe the better way, let’s look at the steps done by the popular method.

First, it calls the “isnull” function. As a result, I get a DataFrame of booleans. Every value tells me whether the value in this cell is undefined.

1
data.isnull()

What is T? It is the transpose operations. This operations “flips” the DataFrame over its diagonal. Columns become rows, and rows turn into columns.

1
data.isnull().T

After that, it calls the “any” function which returns True if at least one value in the row is True.

1
data.isnull().T.any()

That operation returns an array of boolean values — one boolean per row of the original DataFrame.

As the last step, it transposes the result. That last operation does not do anything useful. It is redundant. If we look at the values and the shape of the result after calling only “data.isnull().T.any()” and the full predicate “data.isnull().T.any().T”, we see no difference. That is the first problem with that solution.

1
2
print ((data.isnull().T.any() == data.isnull().T.any().T).all())
print ((data.isnull().T.any().shape == data.isnull().T.any().T.shape))

Finally, the array of booleans is passed to the DataFrame as a column selector. If a position of the array contains True, the row corresponding row will be returned.

1
data[data.isnull().T.any().T]

Now, we see that the favored solution performs one redundant operation.In fact, there are two such operations.

If I use the axis parameter of the “any” function, I can tell it to check whether there is a True value in the row.

Because of that I can get rid of the second transposition and make the code simpler, faster and easier to read:

1
data[data.isnull().any(axis = 1)]

Did you enjoy reading this article?
Would you like to learn more about software craft in data engineering and MLOps?

Subscribe to the newsletter or add this blog to your RSS reader (does anyone still use them?) to get a notification when I publish a new essay!

Newsletter

Do you enjoy reading my articles?
Subscribe to the newsletter if you don't want to miss the new content, business offers, and free training materials.

Bartosz Mikulski

Bartosz Mikulski

  • Data/MLOps engineer by day
  • DevRel/copywriter by night
  • Python and data engineering trainer
  • Conference speaker
  • Contributed a chapter to the book "97 Things Every Data Engineer Should Know"
  • Twitter: @mikulskibartosz
Newsletter

Do you enjoy reading my articles?
Subscribe to the newsletter if you don't want to miss the new content, business offers, and free training materials.