Measuring document similarity in machine learning

In this article, I am going to explain two metrics that can be used to measure difference/similarity of documents, datasets, and everything else that can be represented as a collection of boolean values. The goal is to build an intuition about using those metrics correctly.

Jaccard index

We use the Jaccard index to measure how many elements exist in both sets, because of that, this method is useful only to compare boolean features.

When we have a categorical variable, we must encode them as a collection of boolean features. Numeric variables cannot be represented as a finite number of boolean features unless the cardinality of their values is small enough, so we cannot use the Jaccard index in the case of such features.

(It is also called Intersection over Union. That name may be more familiar to people who deal with putting bounding boxes around images during image classification.)

\[J(A,B) = { {|A \cap B|}\over{|A \cup B|} } = { {|A \cap B|}\over{|A| + |B| - |A \cap B|} }\]

There is a special case, if both sets are empty, the Jaccard index is equal 1.

In Scikit-learn the Jaccard index is implemented by sklearn.metrics.jaccard_score function.

Cosine distance

This metric is not limited to boolean values. It deals with both categorical variables (after encoding) and numeric variables. In this method, every feature becomes one coordinate in an n-dimensional space. If I have only two features, I get coordinates in the two-dimensional space. For example, if I have feature A which has value 3 and feature B equal 5, the obtained coordinates are (3, 5).

An observation is represented as a vector that points from the middle of the coordinate system to the point determined by the coordinates. When I have the vector representation of every document, I can measure the angle between a pair of vectors.

When I calculate the cosine of the angle, I get the Cosine similarity. The cosine value is 1 when both vectors point in the same direction and 0 when vectors point in opposite directions.

To obtain the Cosine distance from Cosine similarity, we have to subtract the Cosine similarity from 1.

\[CosineDistance(A,B) = 1 - CosineSimilarity(A,B)\]

Did you enjoy reading this article?
Would you like to learn more about software craft in data engineering and MLOps?

Subscribe to the newsletter or add this blog to your RSS reader (does anyone still use them?) to get a notification when I publish a new essay!

Newsletter

Do you enjoy reading my articles?
Subscribe to the newsletter if you don't want to miss the new content, business offers, and free training materials.

Bartosz Mikulski

Bartosz Mikulski

  • Data/MLOps engineer by day
  • DevRel/copywriter by night
  • Python and data engineering trainer
  • Conference speaker
  • Contributed a chapter to the book "97 Things Every Data Engineer Should Know"
  • Twitter: @mikulskibartosz
Newsletter

Do you enjoy reading my articles?
Subscribe to the newsletter if you don't want to miss the new content, business offers, and free training materials.