How to measure the similarity of sequence values

To measure the similarity of data sequences, we can use the methods designed to measure the similarity of strings. In this blog post, I am going to show which metrics can be used to measure the difference between ordered sequences of values, usually between two texts.

Levenshtein distance

In the case of text, the Levenshtein distance between two words is the minimum number of single-character edits (insertions, deletions or substitutions).

That method can also be reused for measuring the distance between any pair of sequences, for example, pages visited by a user during a single session or products purchased during a user lifetime.

Kendall tau distance

Kendall tau distance is a metric of difference between rankings. It is defined as the number of swaps to be done while bubble-sorting one sequence to get the same order as the second sequence.

Similarly, we can use the metric to get the difference between any sequences. Conveniently, this metric is implemented in Scipy as the scipy.stats.kendalltau function.

Did you enjoy reading this article?
Would you like to learn more about leveraging AI to drive growth and innovation, software craft in data engineering, and MLOps?

Subscribe to the newsletter or add this blog to your RSS reader (does anyone still use them?) to get a notification when I publish a new essay!

Newsletter

Do you enjoy reading my articles?
Subscribe to the newsletter if you don't want to miss the new content, business offers, and free training materials.

Bartosz Mikulski

Bartosz Mikulski

  • MLOps engineer by day
  • AI and data engineering consultant by night
  • Python and data engineering trainer
  • Conference speaker
  • Contributed a chapter to the book "97 Things Every Data Engineer Should Know"
  • Twitter: @mikulskibartosz
  • Mastodon: @mikulskibartosz@mathstodon.xyz
Newsletter

Do you enjoy reading my articles?
Subscribe to the newsletter if you don't want to miss the new content, business offers, and free training materials.