Detection of Text Duplicates and Text Search with Word Embeddings and Vector Databases

Searching through a vast amount of data can be a challenging task, especially when it comes to detecting duplicates of text. Search methods based on text comparison can be time-consuming and produce irrelevant results. Enter word embeddings and vector databases. If you’re building a RAG system with these technologies, check out my article on advanced RAG techniques.

Table of Contents

What are Word Embeddings?
What are Vector Databases?
How to Create Word Embeddings?
1. Creating Embeddings for the Example Dataset
How to Store Word Embeddings in Milvus?
1. Inserting Data into Milvus
2. Creating Index
How to Search for Similar Texts?
How to Run Duplicate Detection with Word Embeddings?

Word embeddings are numerical representations of words in a vector space, capturing the semantic meaning of words. Vector databases store word embeddings, allowing for fast and efficient searching through large amounts of text data. When we use word embeddings and vector databases for text search and duplicate detection, the results can be more accurate and relevant than old-school methods.

In this article, I will explain the concept of word embeddings and vector databases, how to use them for text search and duplicate detection, and the benefits of using this approach. I will show you how to create embeddings of a given text using the OpenAI GPT-3 model and store vectors, and search through them in the Milvus vector database.

What are Word Embeddings?

Word embeddings are numerical representations of words that preserve the semantic meaning of words. Words with similar meanings should be nearby each other in the vector space. Also, if we generate word embeddings for two similar texts, the distance (usually calculated with the cosine similarity) between the vectors should be small.

Word embeddings are a key component in many natural language processing tasks, such as text classification, sentiment analysis, and machine translation. They allow for the capture of the meaning of words in a way impossible to older methods, such as bag-of-words or one-hot encoding. Furthermore, because words are represented as numbers in the vector space, we can use the distance calculation for text search and duplicate detection. The vectors generated from words containing a given search string will be closer to the vector of the search string than other vectors. Similarly, the distance between the vectors of two similar texts will be almost zero.

What are Vector Databases?

Vector databases store word embeddings, allowing for fast and efficient searching through large amounts of text data. They are a NoSQL database type, meaning they are not relational and do not use SQL queries. Instead, they use vector similarity search algorithms to find similar vectors in the database.

Vector databases use the Approximate nearest neighbor (ANN) algorithm to find similar vectors in the database. In Milvus, the ANN algorithm works only on indexed fields containing vectors. To index the field, we must specify an index type and the similarity metric used to calculate the distance between vectors.

Milvus supports several index types and similarity metrics. Look at the linked documentation to determine which is the best in your case.

How to Create Word Embeddings?

There are many ways to create word embeddings. In this article, I will show you how to use the OpenAI model to generate word embeddings. First, we need to install the OpenAI library.

pip install openai

Then, we need to set the API key. You can get your API key from the OpenAI user account page.

import os
import openai
openai.api_key = os.getenv("OPENAI_API_KEY")

The text-embedding-ada-002 model accepts up to 8191 tokens and outputs a vector with 1536 dimensions. It means we will get a Python list with 1536 numbers. The numbers represent the word embedding of the given text.

We use the get_embeddings function from the OpenAI API to get the embeddings.

embedding = get_embedding(
    line,
    engine="text-embedding-ada-002"
)

If you don’t want to pay for OpenAI, you can replace it with word2vec (for example, the Tensorflow implementation) or SentenceTransformers. Both libraries are free and open-source.

Creating Embeddings for the Example Dataset

I will use the tiny Shakespeare dataset, which contains 40,000 lines of text from Shakespeare’s plays. I won’t use all of the lines. I don’t need so much data to demonstrate the concept, and I don’t want to pay too much for the API calls.

data = open('data.txt', 'r').readlines()
data = data[:1000]

In addition to the vectors, I want to store metadata for each line. The Milvus database doesn’t support storing string fields of any length (yet). I could use the varchar field type, but I don’t want to specify the maximum length of the string. Therefore, I will use the int64 field type to store the index of the line in the original dataset. Additionally, we need a primary key. In this case, the primary key will also be the line index.

primary_keys = []
line_numbers = []
vectors = []
for i, line in enumerate(data):
    embedding = get_embedding(
        line,
        engine="text-embedding-ada-002"
    )
    primary_keys.append(i)
    line_numbers.append(i)
    vectors.append(embedding)

How to Store Word Embeddings in Milvus?

Before we start, we must connect to the database and create a collection of vectors:

from pymilvus import (
    connections,
    utility,
    FieldSchema,
    CollectionSchema,
    DataType,
    Collection,
)

connections.connect("default", host="localhost", port="19530")

As I mentioned before, the collection of vectors may contain optional metadata. In my example, I store the line number. In a real-world application, we could store a UUID of the text, language code, publication date, etc.

fields = [
    FieldSchema(name="pk", dtype=DataType.INT64, is_primary=True, auto_id=False),
    FieldSchema(name="line_number", dtype=DataType.INT64),
    FieldSchema(name="embeddings", dtype=DataType.FLOAT_VECTOR, dim=1536)
]
schema = CollectionSchema(fields, "text lines with metadata")
lines_collection = Collection("lines", schema)

Inserting Data into Milvus

We already have the data in the suitable format, so inserting it into the database doesn’t require additional steps. We can use the insert method to insert the data into the database.

insert_summary = lines_collection.insert([primary_keys, line_numbers, vectors])

The insert_summary object contains information about the inserted data, such as the number of rows inserted or the number of errors:

(insert count: 1000, delete count: 0, upsert count: 0, timestamp: 439432913687150595, success count: 1000, err count: 0)

Creating Index

We can already search for vectors, but unless we create an index, Milvus will use the brute-force search algorithm — iterate over vectors until it finds all relevant data. The brute-force search algorithm is laggard and should be used only for small datasets. To create an index, we use the create_index method.

We must specify the index type and the similarity metric to define an index. I will use the IVF_FLAT index type and the L2 similarity metric in this example. Besides, we need the nlist parameter. The nlist parameter defines the number of clusters in the index. The Milvus documentation offers hints on how to choose the best value for the nlist parameter.

Generally, increasing nlist leads to more buckets and fewer vectors in a bucket during clustering. As a result, the computation load decreases, and search performance improves. However, with fewer vectors for similarity comparison, the correct result might be missed.

How to Select Index Parameters for IVF Index

index_params = {
    "metric_type":"L2",
    "index_type":"IVF_FLAT",
    "params":{"nlist":32}
}

lines_collection.create_index(
    field_name="embeddings",
    index_params=index_params
)

How to Search for Similar Texts?

Before we start, we have to load the collection into Melvus memory. Otherwise, we will get an error.

lines_collection.load()

To search for similar texts, we need to specify the query vector and the number of results we want to get. The search method returns a list of results. Each result contains the primary key of the vector and the distance between the query vector and the result vector. If we want to retrieve other fields, we must specify them in the output_fields parameter.

Every search operation requires search parameters such as the metric type and the nprobe parameter. Look at the article linked below to learn more about setting nprobe properly.

When searching using indexes, the first step is to find a certain number of buckets closest to the target vector and the second step is to find the most similar k vectors from the buckets by vector distance. nprobe is the number of buckets in step one.

How to Select Index Parameters for IVF Index

search_params = {"metric_type": "L2", "params": {"nprobe": 10}, "offset": 0}

array_of_vectors = [
    get_embedding("Search text", engine="text-embedding-ada-002")
]

results = lines_collection.search(
    data=array_of_vectors, # the query vectors
    anns_field="embeddings", # the field with the vectors
    param=search_params,
    limit=10, # the number of results
    output_fields=['line_number'], # additional fields to return
    expr=None, # the expression to filter the results after the search
    consistency_level="Strong"
)

The results contains a list of results. Each corresponds to an element of the array_of_vectors parameter. To get the primary keys of the results, we can use the ids attribute:

results[0].ids

An example result:

[8, 14, 45, 5, 11, 21, 82, 24, 89, 2]

To get the line_number field (or any other metadata field), we must use the get method:

results_of_first_search_vector = results[0]
for result in results_of_first_search_vector:
    line_number = result.entity.get('line_number')
    ...

How to Run Duplicate Detection with Word Embeddings?

We can determine if the text is a duplicate of another text by comparing the distance between the vectors. If the distance is small, the texts are similar. If the distance is zero, the text is identical. If the distance is large, the texts are different.

Because the database can return metadata, you will not only know that a text is a duplicate but also which text is a duplicate of the original text.

Finding Exact Duplicates

Let’s start with exact duplicates.

In Shakespeare’s play, lines uttered by Menenius begin with MENENIUS:. We can find them by creating a query vector from the text MENENIUS:\n and searching for similar texts.

exact_embeddings = get_embedding(
    'MENENIUS:\n',
    engine="text-embedding-ada-002"
)

When we have the vector, we can run a search:

results = lines_collection.search(
    data=[exact_embeddings],
    anns_field="embeddings",
    param=search_params,
    output_fields=['line_number'],
    limit=3,
    expr=None,
    consistency_level="Strong"
)

As expected, the distance between the query vector and the result vectors is zero:

results[0].distances

[0.0, 0.0, 0.0]

Finding Similar Texts

Now, we will search for similar texts. In the play, we have a line, “No more talking on’t; let it be done: away, away!” We will look for a similar line.

almost_duplicate = "No more talking on it; let it be done: away, away!"

search_query_embeddings = get_embedding(
    almost_duplicate,
    engine="text-embedding-ada-002"
)

When we pass the query vector to the search method, we get the following results:

results = lines_collection.search(
    data=[search_query_embeddings],
    anns_field="embeddings",
    param=search_params,
    limit=1,
    expr=None,
    consistency_level="Strong"
)

results[0].distances

[0.034118060022592545]

The distance is not zero, but it is small. The texts are similar.

Distance Threshold for Duplicate Detection

We must decide on the distance threshold. When the distance is below the threshold, we can assume that the texts are duplicates.

What is the right threshold? I don’t know. It depends on the use case. To determine the threshold, we need a validation dataset with known duplicates and known non-duplicates. We can then run the duplicate detection algorithm on the validation dataset and determine the threshold based on the results.

Go From AI Janitor to AI Architect

Stop debugging unpredictable AI systems. I can help you build, measure, and deploy reliable, production-grade AI applications that don't hallucinate.

Message me on LinkedIn