Human bias in A/B testing

Human bias in A/B testing

When we perform an A/B test, we have four possible outcomes: true positive, true negative, false positive, and false negative. The nomenclature we use causes some issues. If the person reading the result is not familiar with A/B testing, they may conclude that the test “failed” if it did not give a true positive result.

The problem with true negative

Let’s begin with clarifying one thing. Even though we may be biased against true negative, it is not a wrong result. It means that we have succeeded in discovering that the control group does not differ from the treatment group.

We can solve that issue by informing everyone that the real purpose of an A/B test is discovering the truth, not confirming that the new version of the website/product/whatever is better than the old one.

After all, in business, you can only make suggestions. Your clients make a decision. They either like what you do, or they don’t.

Are you interested in data engineering?

Check out my other blog https://easydata.engineering

The problem with “corporate A/B testing”

Unfortunately, I have seen A/B tests which were performed only to make a justification for a decision which had been already made.

During such fake tests, many metrics were tracked to be sure that at least one of them gave a positive result (and it did not matter whether it was true positive or false positive). Obviously, that one metric was proclaimed to be the most important one, and the success was announced.

After all, if someone’s bonuses depend on the result of a test, the test is going to pass. For such people, the truth does not matter. They will declare a success, no matter what. Even if their decision will destroy the company in the long run.

The only way to prevent that is explaining what an “underpowered” test is and why we should wait until the end of the test before we start looking at the results (or at least, before we make a decision based on those results).

We should also make sure that there is only one metric used to compare the results. We may track many of them, but the test should use only one, and we must decide which one it is before we begin collecting the data.

Obviously, that only works if you are can impose such rules without negative consequences. If you work in a place where you may get fired for doing your job correctly, I can’t help you.


Remember to share on social media!
If you like this text, please share it on Facebook/Twitter/LinkedIn/Reddit or other social media.

If you watch programming live streams, check out my YouTube channel.
You can also follow me on Twitter: @mikulskibartosz

If you want to hire me, send me a message on LinkedIn or Twitter.


Bartosz Mikulski
Bartosz Mikulski * data scientist / software/data engineer * conference speaker * organizer of School of A.I. meetups in Poznań * co-founder of Software Craftsmanship Poznan & Poznan Scala User Group