Measuring document similarity in machine learning
In this article, I am going to explain two metrics that can be used to measure difference/similarity of documents, datasets, and everything else that can be represented as a collection of boolean values. The goal is to build an intuition about using those metrics correctly.
Jaccard index
We use the Jaccard index to measure how many elements exist in both sets, because of that, this method is useful only to compare boolean features.
When we have a categorical variable, we must encode them as a collection of boolean features. Numeric variables cannot be represented as a finite number of boolean features unless the cardinality of their values is small enough, so we cannot use the Jaccard index in the case of such features.
(It is also called Intersection over Union. That name may be more familiar to people who deal with putting bounding boxes around images during image classification.)
\[J(A,B) = { {|A \cap B|}\over{|A \cup B|} } = { {|A \cap B|}\over{|A| + |B| - |A \cap B|} }\]There is a special case, if both sets are empty, the Jaccard index is equal 1.
In Scikit-learn the Jaccard index is implemented by sklearn.metrics.jaccard_score function.
Parsing machine learning logs with Ahana, a managed Presto service, and Cube, a headless BI solution

Check out my article published on the Cube.dev blog!
Cosine distance
This metric is not limited to boolean values. It deals with both categorical variables (after encoding) and numeric variables. In this method, every feature becomes one coordinate in an n-dimensional space. If I have only two features, I get coordinates in the two-dimensional space. For example, if I have feature A which has value 3 and feature B equal 5, the obtained coordinates are (3, 5).
An observation is represented as a vector that points from the middle of the coordinate system to the point determined by the coordinates. When I have the vector representation of every document, I can measure the angle between a pair of vectors.
When I calculate the cosine of the angle, I get the Cosine similarity. The cosine value is 1 when both vectors point in the same direction and 0 when vectors point in opposite directions.
To obtain the Cosine distance from Cosine similarity, we have to subtract the Cosine similarity from 1.
\[CosineDistance(A,B) = 1 - CosineSimilarity(A,B)\]You may also like
Bartosz Mikulski
- Data/MLOps engineer by day
- DevRel/copywriter by night
- Python and data engineering trainer
- Conference speaker
- Contributed a chapter to the book "97 Things Every Data Engineer Should Know"
- Twitter: @mikulskibartosz