How to perform an A/B test correctly in Python

How to perform an A/B test correctly in Python

What can we expect from a correctly performed A/B test? It should detect the difference between the treatment and the control group only when the difference exists. We want to avoid two kinds of errors: false positives and false negatives.

In this blog post, I am going to summarize the steps of a correctly done A/B test and show you examples of the calculations you need to do.

Define the key metric

It is the most critical step, but fortunately, it does not require any calculations. We must find a metric that tells us whether we achieved the desired outcome.

Let’s look at an example that is not IT related. Assume that I have a restaurant and I would like to sell more strawberry cakes because that is the only kind of dessert I offer. I took some pictures of the cake, but I cannot decide which picture should I put on the menu card. I still have the old menu, the one that I am going to give to the customers from the control group. The new version of the menu is going to be presented to the treatment group.

My key metric is the number of customers who want to buy dessert. At the same time, I am going to track other metrics like the amount of money spent by customers, the time they stay in the restaurant, etc. but the only metric that is used to decide whether a permanent change in the menu design should be done is the percentage of people who purchased the cake.

Do you want to show your product/service to 25000 data science enthusiasts every month? I am looking for companies which would like to become a partner of this blog.

Are you interested? Is your employer interested? Here are the details of the offer.

Define the success

The biggest mistake I have seen in A/B testing is not a mistake in calculations or making a decision based on more than one metric. Such mistakes sometimes happen, but they are easy to fix when someone notices them.

The biggest problem is usually caused by performing an A/B test without defining the hypothesis and the desired outcome. It is problematic because it allows interpreting the outcome in any way we want.

Let’s assume that I just decided to make an A/B test and did not decide what “success” means. When the test ends, I have two values which I can interpret in any way I want. If I preferred variant A, I would interpret the outcome as a success of variant A.

That is why we must define the test parameters before starting the test. After all, there are only two parameters we care about: significance level and power of the test.

Hypothesis

All statistical tests consist of two hypothesizes. The first one is called the null hypothesis. It always claims that there is no significant difference between the treatment and the control group. The second hypothesis is called the alternative hypothesis and always claims that there is a significant difference between the two groups.

That is why there are only two possible outcomes of the test. We can either reject the null hypothesis (which means we accept the alternative hypothesis as the truth) or fail to reject it (in this case we continue to accept the null hypothesis as the truth).

In my example, the null hypothesis is: “The cake photo on the menu card does not influence the number of sold cakes.” The alternative hypothesis is: “More people buy cakes when they see the new photo.”

Test power and underpowered tests

In my “restaurant example” I cannot simply give every guest a different menu card because people come in groups. Every person sitting at the same table must have the same menu card. Because of that, I will need a more extended test because my number of samples is not the number of clients, but the number of groups.

How long must the test be?

It depends on the power of the test we want to have. Typically, we choose 80% as the threshold. Statistical power is the probability that a test rejects a false null hypothesis. Why 80%? I have no idea. It looks like some tradition or “best practice.” Is it enough? We are not trying to discover a cure for cancer, so I assume 80% is enough.

The test power parameter matters because it tells us how many samples we need in both groups to achieve the desired confidence level. To calculate the required number of samples we need to define three parameters.

We already have the statistical power defined (80%). The second parameter is the significance level. We need it to decide whether the null hypothesis should be rejected and also to calculate the number of samples. We are going to use a standard value too (5%).

The last parameter is the effect size. The value describes the difference in terms of the number of standard deviations that the means are different. This value is the minimal difference we want to detect by the A/B test.

Imagine that we currently measure the number of customer groups who buy desserts every day. We know that the current mean number of dessert-eating customers is 123 groups with a standard deviation = 20. We want it to be mean 140 and std 20. The effect size is Cohen’s d value based on those mean and standard deviation values.

1
2
3
4
5
6
7
8
9
10
from math import sqrt

mean0 = 123
mean1 = 140
std = 20

cohens_d = (mean0 - mean1) / (sqrt((std ** 2 + std ** 2) / 2))
cohens_d

#-0.85

After calculating the effect size, I can calculate the number of samples I need to detect such a change between both groups.

1
2
3
4
5
6
7
8
9
10
from statsmodels.stats.power import TTestIndPower

effect = -0.85
alpha = 0.05
power = 0.8

analysis = TTestIndPower()
analysis.solve_power(effect, power=power, nobs1=None, ratio=1.0, alpha=alpha)

#22.728

I need at least 23 observations in both groups to detect the desired effect. In my example, the observation is the number of groups of customers who buy desserts on a given day. Hence the test is going to last for at least 23 days.

Statistical significance and interpreting the results

After gathering the required number of observations, we can use T-student test to check if we can reject the null hypothesis with the expected statistical significance level.

We wanted the significance level to be 0.05 which means that there is only a 5% chance of observing the difference between both groups by chance even if there is no real difference.

Imagine, that I have gathered the results. For the sake of writing a blog post, I generated some random values. If you want to know how to generate random values that fit the desired distribution, look at my Monte Carlo simulation blog post.

Now, we can perform the T-student test.

1
2
3
4
from scipy import stats
stats.ttest_ind(new_menu_sales, old_menu_sales, equal_var=False)

#Ttest_indResult(statistic=2.664523580261259, pvalue=0.01078649822078583)

This function performs a two-tailed test, so I have to divide the returned p-value by 2.

If the p-value is smaller than the significant level and the returned statistic parameter is larger than 0, we can reject the null hypothesis.

The values indicate that I can reject the null hypothesis. Because of that we can conclude that the new picture on the menu helps us sell more cakes ;)


Remember to share on social media!
If you like this text, please share it on Facebook/Twitter/LinkedIn/Reddit or other social media.

If you watch programming live streams, check out my YouTube channel.
You can also follow me on Twitter: @mikulskibartosz

If you want to hire me, send me a message on LinkedIn or Twitter.


If this article was helpful, consider donating to WWF or any other charity of your choice.
Bartosz Mikulski
Bartosz Mikulski * data scientist / software engineer * conference speaker * organizer of School of A.I. meetups in Poznań * co-founder of Software Craftsmanship Poznan & Poznan Scala User Group