How to return rows with missing values in Pandas DataFrame

The task is easy. I have a DataFrame which has missing values, but I don’t know where they are. I want to get a DataFrame which contains only the rows with at least one missing values.

If I look for the solution, I will most likely find this:

1
data[data.isnull().T.any().T]

It gets the job done, and it returns the correct result, but there is a better solution. Before I describe the better way, let’s look at the steps done by the popular method.

First, it calls the “isnull” function. As a result, I get a DataFrame of booleans. Every value tells me whether the value in this cell is undefined.

1
data.isnull()

What is T? It is the transpose operations. This operations “flips” the DataFrame over its diagonal. Columns become rows, and rows turn into columns.

1
data.isnull().T

After that, it calls the “any” function which returns True if at least one value in the row is True.

1
data.isnull().T.any()

That operation returns an array of boolean values — one boolean per row of the original DataFrame.

Are you interested in data engineering?

Check out my other blog https://easydata.engineering

As the last step, it transposes the result. That last operation does not do anything useful. It is redundant. If we look at the values and the shape of the result after calling only “data.isnull().T.any()” and the full predicate “data.isnull().T.any().T”, we see no difference. That is the first problem with that solution.

1
2
print ((data.isnull().T.any() == data.isnull().T.any().T).all())
print ((data.isnull().T.any().shape == data.isnull().T.any().T.shape))

Finally, the array of booleans is passed to the DataFrame as a column selector. If a position of the array contains True, the row corresponding row will be returned.

1
data[data.isnull().T.any().T]

Now, we see that the favored solution performs one redundant operation.In fact, there are two such operations.

If I use the axis parameter of the “any” function, I can tell it to check whether there is a True value in the row.

Because of that I can get rid of the second transposition and make the code simpler, faster and easier to read:

1
data[data.isnull().any(axis = 1)]

Remember to share on social media!
If you like this text, please share it on Facebook/Twitter/LinkedIn/Reddit or other social media.

If you watch programming live streams, check out my YouTube channel.
You can also follow me on Twitter: @mikulskibartosz

If you want to hire me, send me a message on LinkedIn or Twitter.


Bartosz Mikulski
Bartosz Mikulski * data scientist / software/data engineer * conference speaker * organizer of School of A.I. meetups in Poznań * co-founder of Software Craftsmanship Poznan & Poznan Scala User Group