Selecting rows in Pandas

In Pandas, we have multiple methods of selecting the data. Let’s take a look at the four most popular ones.

We will start with a DataFrame containing five rows:

  col_A col_B
0 1 A
1 2 B
2 3 C
3 4 D
4 5 E

the loc function

First, we will use the loc function. loc lets us select rows using the DataFrame index. For example, if we write data.loc[[0,1,4]], we will get the first, the second, and the last row of our DataFrame.

  col_A col_B
0 1 A
1 2 B
4 5 E

Of course, it’s difficult to spot the benefit of using the loc function when we have a numeric index. Because of that, we will set the col_B column as the index and use its values to select the rows:

data.set_index('col_B').loc[['A', 'B', 'E']]
col_B col_A
A 1
B 2
E 5

the iloc function

Similarly to loc with a numeric index, we can use the iloc function to retrieve rows using their position in the DataFrame. Let’s retrieve the last two rows:

data.iloc[[3,4]]
  col_A col_B
3 4 D
4 5 E

Using a binary mask

In Pandas, we can pass a binary array to the DataFrame selector to retrieve the corresponding rows.

We are going to need an array of bool values. The array must have the same length as our DataFrame.

binary = [True, False, True, True, False]
data[binary]
  col_A col_B
0 1 A
2 3 C
3 4 D

The most popular data selection method involves generating the binary array using the values from the DataFrame. For example, we can retrieve the rows in which col_A has values smaller than 3:

data[data['col_A'] < 3]
  col_A col_B
0 1 A
1 2 B

Slicing a DataFrame

Finally, we can use the slicing operation that works like the same operation in Python lists.

data[2:3]
  col_A col_B
2 3 C
data[:2]
  col_A col_B
0 1 A
1 2 B
data[1:]
  col_A col_B
1 2 B
2 3 C
3 4 D
4 5 E
data[::2]
  col_A col_B
0 1 A
2 3 C
4 5 E
Older post

Python decorators explained

How can we define a Python decorator, and when should we use Python decorators.

Newer post

ETL vs ELT - what's the difference? Which one should you choose?

Should you use a data warehouse or build a data lake? When is a data warehouse a better choice? When is it better to build a data lake?