How to sort a Pandas DataFrame by month name

The alphabetical order makes no sense when we sort a Pandas DataFrame by month names. How can we sort the values by months while keeping the proper order and still having month names in the column?

The simplest option may be to add a column with the month position in the calendar and sort by it. However, to me, it looks like an ugly hack. There must be a better way.

First, let’s create a Pandas DataFrame and mix the order of months:

import pandas as pd
df = pd.DataFrame({
    'date': ['2022-08-01', '2022-09-01', '2022-10-01', '2022-11-01', '2022-12-01'],
    'value': [123, 456, 789, 345, 678]
})

df['month'] = pd.to_datetime(df['date']).apply(lambda x: x.month_name())
df = df.sort_values('value')

Now, the df variable contains out-of-order variables:

| date       |   value | month     |
|:-----------|--------:|:----------|
| 2022-08-01 |     123 | August    |
| 2022-11-01 |     345 | November  |
| 2022-09-01 |     456 | September |
| 2022-12-01 |     678 | December  |
| 2022-10-01 |     789 | October   |

Let’s fix it.

The wrong way - putting months in the alphabetical order

To verify that we have a problem sorting the months, let’s try using the sort_values method on the month column. It will sort them alphabetically by names. That’s probably the worst way to display time-based data.

df.sort_values('month')

The result is a disaster:

| date       |   value | month     |
|:-----------|--------:|:----------|
| 2022-08-01 |     123 | August    |
| 2022-12-01 |     678 | December  |
| 2022-11-01 |     345 | November  |
| 2022-10-01 |     789 | October   |
| 2022-09-01 |     456 | September |

Sorting months by names in Pandas

To sort the months correctly, we must define the order first. We could create a list of months in the correct order by hand, but using the date_range function seems easier:

dates_in_order = pd.date_range(start='2022-01-01', end='2022-12-01', freq='MS')

As a result, we have a DataFrameIndex with the first days of every month in 2022:

DatetimeIndex(['2022-01-01', '2022-02-01', '2022-03-01', '2022-04-01',
               '2022-05-01', '2022-06-01', '2022-07-01', '2022-08-01',
               '2022-09-01', '2022-10-01', '2022-11-01', '2022-12-01'],
              dtype='datetime64[ns]', freq='MS')

Now, we can extract the month name from each date:

months_in_order = dates_in_order.map(lambda x: x.month_name()).to_list()

We have created a list of months:

['January',
 'February',
 'March',
 'April',
 'May',
 'June',
 'July',
 'August',
 'September',
 'October',
 'November',
 'December']

Now, let’s use the list of month names to define the month column as an ordered categorical variable:

df.month = pd.Categorical(
    df.month,
    categories=months_in_order,
    ordered=True
)

Our DataFrame is still not sorted correctly, but this time when we use the sort_values function, we will get the expected result:

df.sort_values('month')
| date       |   value | month     |
|:-----------|--------:|:----------|
| 2022-08-01 |     123 | August    |
| 2022-09-01 |     456 | September |
| 2022-10-01 |     789 | October   |
| 2022-11-01 |     345 | November  |
| 2022-12-01 |     678 | December  |
Older post

How to become a data engineer for free

What do you need to know to become a data engineer? Does a data engineer need a degree? How can you get your first data engineering job?

Newer post

Software engineering practices in data engineering and data science

How to produce high-quality software in data teams by applying software engineering practices to data science and data engineering