How to get the value by rank from a grouped Pandas dataframe
Let’s say that we have a data frame containing all purchases done by all our customers. We want to get the most recent purchase of every client. How do we do it?
We may use the grouping function to get the purchases of every customer and then get the most recent one from every group.
Imagine that the “group_name” column contains the identifier of the customer. The “used_for_sorting” column is our date of purchase, so we want the largest value in the group.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 import pandas as pd data = pd.DataFrame([ ['a', 1, 'A'], ['a', 2, 'B'], ['a', 3, 'C'], ['b', 5, 'D'], ['b', 6, 'E'], ['b', 7, 'F'], ['b', 8, 'G'], ['c', 10, 'H'], ['c', 11, 'I'], ['c', 12, 'J'], ['c', 13, 'K'] ], columns = ['group_name', 'used_for_sorting', 'the_value'])
Get the first value from a group
In Pandas such a solution looks like that.
First, I have to sort the data frame by the “used_for_sorting” column.
I must do it before I start grouping because sorting of a grouped data frame is not supported and the groupby function does not sort the value within the groups, but it preserves the order of rows.
1 sorted_data_frame = data.sort_values(['used_for_sorting'], ascending=False)
Now, I can group the data frame by the customer identifier. In my case, the “group_name” is the customer identifier.
1 grouped_data_frame = sorted_data_frame.groupby('group_name')
After that, I must get the first value from every group. There are two options. I can either use the first function or the nth function with parameter 1. Obviously, I am going to choose the first function.
If you don’t want to use the group name as the index, remember to drop it.
All in one line, it looks like this:
1 2 3 4 5 data \ .sort_values(['used_for_sorting'], ascending = False) \ .groupby('group_name') \ .first() \ .reset_index()
1 2 3 4 group_name used_for_sorting the_value 0 a 3 C 1 b 8 G 2 c 13 K
Return the rank
What if instead of returning one row I want to get all of the rows with their rank?
In this case, I have to:
group the data frame by the group_name column
get the series (column) from the grouped data frame
calculate the rank of the series
add the rank as a new column in the original data frame
1 2 3 4 5 ranks = data \ .groupby('group_name')['used_for_sorting'] \ .rank(ascending = True, method = 'first') ranks.name = 'rank' pd.concat([data, ranks], axis = 1)
1 2 3 4 5 6 7 8 9 10 11 12 group_name used_for_sorting the_value rank 0 a 1 A 1.0 1 a 2 B 2.0 2 a 3 C 3.0 3 b 5 D 1.0 4 b 6 E 2.0 5 b 7 F 3.0 6 b 8 G 4.0 7 c 10 H 1.0 8 c 11 I 2.0 9 c 12 J 3.0 10 c 13 K 4.0
Did you enjoy reading this article?
Would you like to learn more about software craft in data engineering and MLOps?
Subscribe to the newsletter or add this blog to your RSS reader (does anyone still use them?) to get a notification when I publish a new essay!
You may also like
- How to load data from Google Drive to Pandas running in Google Colaboratory
- How to display mathematical equations in Jupyter Notebook
- How to turn Pandas data frame into time-series input for RNN
- How to display all columns of a Pandas DataFrame in Jupyter Notebook
- Looking for structure in data — Andrews curves plot explained
- Data/MLOps engineer by day
- DevRel/copywriter by night
- Python and data engineering trainer
- Conference speaker
- Contributed a chapter to the book "97 Things Every Data Engineer Should Know"
- Twitter: @mikulskibartosz