How to deal with days of the week in machine learning
What should we do when we train a machine learning model, and time is one of the features? In this article, I will show you a few options and tell you why only some of them make sense ;)
I am going to use a DataFrame with two features. The first one is a date and the second one is the number of people who visited this blog on a given day.
1 2 3 4 5 6 date visitors 3/17/21 842 3/18/21 914 3/19/21 956 3/20/21 361 3/21/21 410
Before I start presenting the options to handle the day of the week, I will convert the
date column into datetime and derive an additional column with a number indicating the day of the week.
1 2 data['date'] = pd.to_datetime(data.date) data['day_of_week'] = data.date.dt.weekday
Turn the day of the week into boolean value “is_weekend”
Before we start training the model, we do data exploration, and we may notice a difference between values during weekdays and weekends. In such a situation, we may convert the day of week feature into a boolean value to indicate whether a given day was Saturday or Sunday.
1 data['is_weekend'] = data['day_of_week'].isin([5, 6])
This solution may be good enough when weekdays data differs from weekend data, but values during weekends are similar.
We can use one-hot encoding to produce a boolean feature for every day of the week. Such a solution gives us information about the day of the week, but we get rid of relations between the days. Effectively, we decide that the order of days no longer matters. Is it the case? Not always. Usually, we should not use one-hot encoding to encode days of weeks.
1 2 day_of_week_columns = pd.get_dummies(data['day_of_week']) data.merge(day_of_week_columns, left_index=True, right_index=True)
Encoding day of the week as a number
We have already done it in this example. The
day_of_week column contains values between 0 and 6 that denote the day of the week.
It is not the right way to encode days of the week if we want to use the data to train machine learning models! In reality, Saturday is closer to Monday than Wednesday. Encoding days of the week as numbers changes the sense of data.
We don’t want to lose the information about the circular nature of weeks and the actual distance between the days. Therefore, we can encode the day of week feature as “points” on a circle: 0° = Monday, 51.5° = Tuesday, etc.
There is one problem. We know that it is a circle, but for a machine learning model, the difference between Sunday and Monday is 308.5° instead of 51.5°. That is wrong.
To solve the problem we have to calculate the cosinus and sinus values of the degree. We need both because both functions produce duplicate outputs for difference inputs, but when we use them together we get unique pairs of values:
1 2 data['day_of_week_sin'] = np.sin(data['day_of_week'] * (2 * np.pi / 7)) data['day_of_week_cos'] = np.cos(data['day_of_week'] * (2 * np.pi / 7))
Did you enjoy reading this article?
Would you like to learn more about software craft in data engineering and MLOps?
Subscribe to the newsletter or add this blog to your RSS reader (does anyone still use them?) to get a notification when I publish a new essay!
You may also like
- Data/MLOps engineer by day
- DevRel/copywriter by night
- Python and data engineering trainer
- Conference speaker
- Contributed a chapter to the book "97 Things Every Data Engineer Should Know"
- Twitter: @mikulskibartosz