Measuring document similarity in machine learning

In this article, I am going to explain two metrics that can be used to measure difference/similarity of documents, datasets, and everything else that can be represented as a collection of boolean values. The goal is to build an intuition about using those metrics correctly.

Jaccard index

We use the Jaccard index to measure how many elements exist in both sets, because of that, this method is useful only to compare boolean features.

When we have a categorical variable, we must encode them as a collection of boolean features. Numeric variables cannot be represented as a finite number of boolean features unless the cardinality of their values is small enough, so we cannot use the Jaccard index in the case of such features.

(It is also called Intersection over Union. That name may be more familiar to people who deal with putting bounding boxes around images during image classification.)

There is a special case, if both sets are empty, the Jaccard index is equal 1.

In Scikit-learn the Jaccard index is implemented by sklearn.metrics.jaccard_score function.



Cosine distance

This metric is not limited to boolean values. It deals with both categorical variables (after encoding) and numeric variables. In this method, every feature becomes one coordinate in an n-dimensional space. If I have only two features, I get coordinates in the two-dimensional space. For example, if I have feature A which has value 3 and feature B equal 5, the obtained coordinates are (3, 5).

An observation is represented as a vector that points from the middle of the coordinate system to the point determined by the coordinates. When I have the vector representation of every document, I can measure the angle between a pair of vectors.

When I calculate the cosine of the angle, I get the Cosine similarity. The cosine value is 1 when both vectors point in the same direction and 0 when vectors point in opposite directions.

To obtain the Cosine distance from Cosine similarity, we have to subtract the Cosine similarity from 1.


Remember to share on social media!
If you like this text, please share it on Facebook/Twitter/LinkedIn/Reddit or other social media.

If you watch programming live streams, check out my YouTube channel.
You can also follow me on Twitter: @mikulskibartosz

If you want to hire me, send me a message on LinkedIn or Twitter.


Bartosz Mikulski
Bartosz Mikulski * big data engineer * conference speaker * co-founder of Software Craftsmanship Poznan & Poznan Scala User Group