Predicting customer churn using the Pareto/NBD model

Predicting customer churn using the Pareto/NBD model

In this blog post, I am going to build a Pareto/NBD model to predict the number of customer visits in a given period. Such a model is the first part of predicting the customer lifetime value, but I am not going to use it for CLV prediction. I am going to focus on using it to predict customer churn. After all, if we see that the forecast for upcoming purchases is 0, we can send the customer a farewell message.

First, we are going to need a list of customer transactions, but not all of them! The model works best if we split customers into cohorts and the input dataset contains only one cohort of customers.

1
2
3
4
import pandas as pd
import lifetimes
data = pd.read_csv("data.csv", header = 0)
data['date'] = pd.to_datetime(data['date'])
First 5 rows from the dataset.
First 5 rows from the dataset.

A cohort consists of all customers who made the first purchase in the same period. For example, we can get the transaction log of all customers who started buying our products in Q1 2018.

Customer transactions

In the second step, we must calculate a summary of the customer’s transaction log. We need three numbers calculated separately for every customer in the cohort. Those three numbers are:

  • recency — the time between the first and the last transaction

  • frequency — the number of purchases beyond the initial one

  • T — the time between the first purchase and the end of the calibration period

1
2
summary = lifetimes.utils.summary_data_from_transaction_data(data, 'cust', 'date')
summary = summary.reset_index()

To calculate the summary values, we are going to use the summary_data_from_transaction_data function from the lifetimes library. Note that, I did not specify the calibration period because I want to use all transactions from the input data frame. Also, the output is not going to be a standard RFM structure, because in this case, I am not interested in the monetary value, so I left this parameter out.

In the “lifetimes” implementation, the model has a penalizer_coef hyperparameter. I am not going to specify it in the first example, but in the second part of the blog post, I will show you how to split the transaction log into the training/test datasets. That will allow us to tune the hyperparameters.

Now, we can fit the model.

1
2
model = lifetimes.ParetoNBDFitter()
model.fit(summary['frequency'], summary['recency'], summary['T'])

After fitting the model, we can start predicting whether the customer is going to make another purchase. I need the summary metrics generated from the transaction log and the conditional_probability_alive function.

1
2
3
4
customer_id = 4

individual = summary[summary['cust'] == customer_id]
model.conditional_probability_alive(individual['frequency'], individual['recency'], individual['T'])

I promised that the model is going to predict customer churn, so we are not interested in the probability of being alive (which in this case mean: “the customer will buy something”), but the opposite of it:

1
1 - model.conditional_probability_alive(individual['frequency'], individual['recency'], individual['T'])

Cross-validation

The model we built in the first example may be incorrect. We can never be sure whether it gives us the correct values, because we have not tested it. To test the model, we must split the transaction log into the training and test (aka. holdout) datasets. We are going to build the model using the training set and then test its correctness using the test dataset.

Fortunately, in the lifetimes library that can be done using just two functions. Isn’t it convenient?

First, we must split the data.

1
2
3
4
5
6
7
8
9
from lifetimes.utils import calibration_and_holdout_data

summary_cal_holdout = calibration_and_holdout_data(
    data,
    'cust',
    'date',
    calibration_period_end=pd.to_datetime('1997-12-31'),
    observation_period_end=data['date'].max()
)

Now, we can build the model and tune its parameter.

1
2
model = lifetimes.ParetoNBDFitter(penalizer_coef=0.01)
model.fit(summary_cal_holdout['frequency_cal'], summary_cal_holdout['recency_cal'], summary_cal_holdout['T_cal'])

After that, we can forecast the customer purchases and compare the results with the real data from the holdout dataset.

1
2
3
4
5
6
from lifetimes.utils import expected_cumulative_transactions

freq = 'D' # days
number_of_days = 14

expected_vs_actual = expected_cumulative_transactions(model, holdout_set_transactions, 'date', 'cust', number_of_days, freq)

In machine learning, we asses the model performance by calculating one numeric value that describes the errors made by the model. I already have a dataset of actual and predicted values. Hence, I can calculate the RMSE:

1
2
3
4
from sklearn.metrics import mean_squared_error
from math import sqrt

sqrt(mean_squared_error(expected_vs_actual['actual'], expected_vs_actual['predicted']))

Remember to share on social media!
If you like this text, please share it on Facebook/Twitter/LinkedIn/Reddit or other social media.

If you watch programming live streams, check out my YouTube channel.
You can also follow me on Twitter: @mikulskibartosz

For business inquiries, send me a message on LinkedIn or Twitter.


Bartosz Mikulski
Bartosz Mikulski * data scientist / software engineer * conference speaker * organizer of School of A.I. meetups in Poznań * co-founder of Software Craftsmanship Poznan & Poznan Scala User Group