Wilson score in Python - example
Wilson score is a method of estimating the population probability from a sample probability when the probability follows the binomial distribution. As a result, we get a range of probabilities with an expected confidence interval. In this article, I am going to show how to calculate the Wilson score, describe its input variable, and explain how to interpret the result.
Let’s begin with the binomial distribution. It is the distribution of observations when there are only two possible outcomes, for example, a coin toss, clicked the “like” button or not, purchased a product/did not purchased it. In my example, I want to know how many people are going to read an article on a website. I know that 989 people clicked the link, and 737 people scrolled to the bottom of the page (so I count them as the people who read the article). We see that the sample proportion is around 0.745 (74.5% of people who opened the article scroll to the bottom). We also know that the variable follows the binomial distribution because there are only two possible outcomes (read the article or did not read it).
To calculate the Wilson score we need three things:
the expected confidence interval of the Wilson score, usually 95%
the sample size - in my cases 989
the sample proportion - 0.745
How to calculate the Wilson score
- In the first step, I must look up the z-score value for the desired confidence interval in a z-score table. The z-score for a 95% confidence interval is 1.96.
1 z = 1.96
- Calculate the Wilson denominator
1 denominator = 1 + z**2/n
- Calculate the Wilson centre adjusted probability
1 centre_adjusted_probability = p + z*z / (2*n)
- Calculate the Wilson adjusted standard deviation
1 adjusted_standard_deviation = sqrt((p*(1 - p) + z*z / (4*n)) / n)
- Calculate the Wilson score interval
1 2 lower_bound = (centre_adjusted_probability - z*adjusted_standard_deviation) / denominator upper_bound = (centre_adjusted_probability + z*adjusted_standard_deviation) / denominator
Here is the Python code of the whole function.
1 2 3 4 5 6 7 8 9 from math import sqrt def wilson(p, n, z = 1.96): denominator = 1 + z**2/n centre_adjusted_probability = p + z*z / (2*n) adjusted_standard_deviation = sqrt((p*(1 - p) + z*z / (4*n)) / n) lower_bound = (centre_adjusted_probability - z*adjusted_standard_deviation) / denominator upper_bound = (centre_adjusted_probability + z*adjusted_standard_deviation) / denominator return (lower_bound, upper_bound)
When I put my example values as the parameters, I get:
1 2 3 4 5 positive = 737 total = 989 p = positive / total (p, wilson(p, total)) # (0.7451971688574317, (0.7171265544922645, 0.7713703014009615))
Are you interested in data engineering?
Check out my other blog https://easydata.engineering
I wanted the confidence interval to be 95%, so the Wilson score gives me two numbers which tell me that given my sample size and the sample proportion, there is a 95% probability that the population proportion is between the lower bounds of the Wilson score and its upper bounds.
I wanted to know how many people are going to read the article, so there is a 95% probability that at least 703 people and no more than 763 people read the article. To calculate those values, multiply the Wilson bounds by the sample size, and round the result to an integer.
Remember to share on social media! If you like this text, please share it on Facebook/Twitter/LinkedIn/Reddit or other social media.