Human bias in A/B testing

When we perform an A/B test, we have four possible outcomes: true positive, true negative, false positive, and false negative. The nomenclature we use causes some issues. If the person reading the result is not familiar with A/B testing, they may conclude that the test “failed” if it did not give a true positive result.

The problem with true negative

Let’s begin with clarifying one thing. Even though we may be biased against true negative, it is not a wrong result. It means that we have succeeded in discovering that the control group does not differ from the treatment group.

We can solve that issue by informing everyone that the real purpose of an A/B test is discovering the truth, not confirming that the new version of the website/product/whatever is better than the old one.

After all, in business, you can only make suggestions. Your clients make a decision. They either like what you do, or they don’t.

Would you like to help fight youth unemployment while getting mentoring experience?

Develhope is looking for tutors (part-time, freelancers) for their upcoming Data Engineer Courses.

The role of a tutor is to be the point of contact for students, guiding them throughout the 6-month learning program. The mentor supports learners through 1:1 meetings, giving feedback on assignments, and responding to messages in Discord channels—no live teaching sessions.

Expected availability: 15h/week. You can schedule the 1:1 sessions whenever you want, but the sessions must happen between 9 - 18 (9 am - 6 pm) CEST Monday-Friday.

Check out their job description.

(free advertisement, no affiliate links)

The problem with “corporate A/B testing”

Unfortunately, I have seen A/B tests which were performed only to make a justification for a decision which had been already made.

During such fake tests, many metrics were tracked to be sure that at least one of them gave a positive result (and it did not matter whether it was true positive or false positive). Obviously, that one metric was proclaimed to be the most important one, and the success was announced.

After all, if someone’s bonuses depend on the result of a test, the test is going to pass. For such people, the truth does not matter. They will declare a success, no matter what. Even if their decision will destroy the company in the long run.

The only way to prevent that is explaining what an “underpowered” test is and why we should wait until the end of the test before we start looking at the results (or at least, before we make a decision based on those results).

We should also make sure that there is only one metric used to compare the results. We may track many of them, but the test should use only one, and we must decide which one it is before we begin collecting the data.

Obviously, that only works if you are can impose such rules without negative consequences. If you work in a place where you may get fired for doing your job correctly, I can’t help you.

Remember to share on social media!
If you like this text, please share it on Facebook/Twitter/LinkedIn/Reddit or other social media.

If you want to contact me, send me a message on LinkedIn or Twitter.

Bartosz Mikulski
Bartosz Mikulski * MLOps Engineer / data engineer * conference speaker * co-founder of Software Craft Poznan & Poznan Scala User Group

Subscribe to the newsletter and get access to my free email course on building trustworthy data pipelines.