Probability plot - visually compare probability distributions

Probability plot is a method of comparing probability distribution which can be easily interpreted by humans. It is not some weird number which is denoted by some Greek letter you can’t even pronounce correctly. Probability plot is just a picture.

Probability plot with Scikit learn

Let’s plot the probability plot of the age variable from the Titanic dataset. I am going to check whether it fits to a normal distribution.

from scipy.stats import probplot
import seaborn as sea
import matplotlib.pyplot as plt
titanic = sea.load_dataset("titanic")
probplot(titanic.age, dist="norm", plot = plt)
Probability plot of the age column from the Titanic dataset.
Probability plot of the age column from the Titanic dataset.

What do we know? The data is not normally distributed, that we know for sure. If it were the probability plot would look like this:

from scipy.stats import norm
example = norm.rvs(200, size = 200)
probplot(example, dist="norm", plot = plt)
A probability distribution that matches the given distribution. In this case, a normal distribution.
A probability distribution that matches the given distribution. In this case, a normal distribution.

What if we want to compare the data to a different kind of probability distribution? Just change the value of the “dist” parameter. The interpretation of the output stays the same.

Would you like to help fight youth unemployment while getting mentoring experience?

Develhope is looking for tutors (part-time, freelancers) for their upcoming Data Engineer Courses.

The role of a tutor is to be the point of contact for students, guiding them throughout the 6-month learning program. The mentor supports learners through 1:1 meetings, giving feedback on assignments, and responding to messages in Discord channels—no live teaching sessions.

Expected availability: 15h/week. You can schedule the 1:1 sessions whenever you want, but the sessions must happen between 9 - 18 (9 am - 6 pm) CEST Monday-Friday.

Check out their job description.

(free advertisement, no affiliate links)


If the distribution of given sample fits the given probability distribution, the plot displays a line that follows the line x = y (the second plot in this blog post).

What does the first plot mean? An arced line (or even S-shaped line) indicates that the given sample distribution is skewed or has a “heavy tail”. Is it true? Let’s look at the age variable histogram:

The histogram of the age variable
The histogram of the age variable

As expected the distribution is skewed.

Why the probability plot is useful? You can quickly draw the plot, look at it, and know whether the sample has the expected distribution. You don’t need a statistical test. You don’t need to think about the significance level. You don’t need to think about the null hypothesis and the alternative hypothesis. You don’t need to decide whether you want to use a Shapiro–Wilk test, Kolmogorov–Smirnov test or something else. You just make a plot and look at it ;)

Remember to share on social media!
If you like this text, please share it on Facebook/Twitter/LinkedIn/Reddit or other social media.

If you want to contact me, send me a message on LinkedIn or Twitter.

Bartosz Mikulski
Bartosz Mikulski * MLOps Engineer / data engineer * conference speaker * co-founder of Software Craft Poznan & Poznan Scala User Group

Subscribe to the newsletter and get access to my free email course on building trustworthy data pipelines.