Wilson score in Python - example

Wilson score is a method of estimating the population probability from a sample probability when the probability follows the binomial distribution. As a result, we get a range of probabilities with an expected confidence interval.

In this article, I am going to show how to calculate the Wilson score, describe its input variable, and explain how to interpret the result.

Example

Let’s begin with the binomial distribution. It is the distribution of observations when there are only two possible outcomes, for example, a coin toss, clicked the “like” button or not, purchased a product/did not purchased it.

Imagine that, I want to know how many people are going to read an article on a website. I know that 989 people clicked the link, and 737 people scrolled to the bottom of the page. We assume that the people who scrolled to the bottom have read the article.

We see that the sample proportion is around 0.745 (74.5% of people who opened the article scroll to the bottom). We also know that the variable follows the binomial distribution because there are only two possible outcomes: read the article or did not read it.

To calculate the Wilson score we need three things:

  • the expected confidence interval of the Wilson score, usually 95%

  • the sample size - in my cases 989

  • the sample proportion - 0.745

How to calculate the Wilson score

  1. In the first step, I must look up the z-score value for the desired confidence interval in a z-score table. The z-score for a 95% confidence interval is 1.96.
1
z = 1.96
  1. Calculate the Wilson denominator
1
denominator = 1 + z**2/n
  1. Calculate the Wilson centre adjusted probability
1
centre_adjusted_probability = p + z*z / (2*n)
  1. Calculate the Wilson adjusted standard deviation
1
adjusted_standard_deviation = sqrt((p*(1 - p) + z*z / (4*n)) / n)
  1. Calculate the Wilson score interval
1
2
lower_bound = (centre_adjusted_probability - z*adjusted_standard_deviation) / denominator
upper_bound = (centre_adjusted_probability + z*adjusted_standard_deviation) / denominator

Here is the Python code of the whole function.

1
2
3
4
5
6
7
8
9
from math import sqrt
def wilson(p, n, z = 1.96):
    denominator = 1 + z**2/n
    centre_adjusted_probability = p + z*z / (2*n)
    adjusted_standard_deviation = sqrt((p*(1 - p) + z*z / (4*n)) / n)
    
    lower_bound = (centre_adjusted_probability - z*adjusted_standard_deviation) / denominator
    upper_bound = (centre_adjusted_probability + z*adjusted_standard_deviation) / denominator
    return (lower_bound, upper_bound)

When I put my example values as the parameters, I get:

1
2
3
4
5
positive = 737
total = 989
p  = positive / total
(p, wilson(p, total))
# (0.7451971688574317, (0.7171265544922645, 0.7713703014009615))

In this case, the lower bound of the Wilson score is 0.717, and the upper bound is: 0.771.

Interpretation

Wilson score gives me two numbers which tell me that given my sample size and the sample proportion, there is a 95% probability that between 71.7% and 77.1% of visitors are going to read the article. To get the actual number of people, I have to multiply the Wilson score bound by the sample size and round the result to an integer. In my example, I get 703 and 763.

Did you enjoy reading this article?
Would you like to learn more about software craft in data engineering and MLOps?

Subscribe to the newsletter or add this blog to your RSS reader (does anyone still use them?) to get a notification when I publish a new essay!

Newsletter

Do you enjoy reading my articles?
Subscribe to the newsletter if you don't want to miss the new content, business offers, and free training materials.

Bartosz Mikulski

Bartosz Mikulski

  • Data/MLOps engineer by day
  • DevRel/copywriter by night
  • Python and data engineering trainer
  • Conference speaker
  • Contributed a chapter to the book "97 Things Every Data Engineer Should Know"
  • Twitter: @mikulskibartosz
Newsletter

Do you enjoy reading my articles?
Subscribe to the newsletter if you don't want to miss the new content, business offers, and free training materials.