Which index should you use while building an application with LlamaIndex?

LlamaIndex is supposed to help us build applications using both LLMs and our private data. LlamaIndex provides the data access layer for AI. Besides multiple storage implementations, LlamaIndex also offers multiple indexing methods. Our choice will affect the application’s performance and the cost of using the language model. Let’s compare several implementations of the indexes and see what happens when we use them. For techniques to deal with hallucinations in RAG systems, see my article on preventing LLM hallucinations.

Table of Contents

What is an index?
What is the difference between indexes?

What is an index?

Jerry Liu wrote “an Index manages the state: abstracting away underlying storage, and exposing a view over processed data & associated metadata.” LlamaIndex indexes nodes. Nodes are chunks of the documents we load from the storage.

To retain the most control over the creation of the nodes, we can load the data ourselves and create documents.

# creating documents the manual way
from llama_index import Document

chunks_of_text = [...]
documents = [Document(t) for t in chunks_of_text]

We can decide whether a LlamaIndex Document is an entire document from our business domain, a chapter, a paragraph, or whatever makes sense. However, even when we use such a fine-grained approach, LlamaIndex may convert a single document to multiple nodes if it is too long.

If we have to split a document into multiple nodes, the llama_index.data_structs.node.DocumentRelationship class will help us to keep track of the relationship between the nodes.

Usually, we won’t create the nodes manually, though. Instead, we can use the index to prepare the nodes. Later, we will use the same index to retrieve data, so automated parsing and indexing are preferred because the from_documents function also creates the relevant metadata and the relationships between the nodes.

What is the difference between indexes?

If we let the library parse the documents, we should at least know the available options and what happens when we choose them. To test it, I print the debugging information after every operation, so we can see what happens under the hood.

First, I load the documents from Markdown files in the data directory:

from llama_index import SimpleDirectoryReader

documents = SimpleDirectoryReader('data').load_data()

For the test, I put three files there, so we end up with 15 LlamaIndex Documents in the documents list.

Now, we configure the LLM implementation and the debugging handler:

from llama_index.callbacks import CallbackManager, LlamaDebugHandler
from llama_index import ServiceContext, LLMPredictor
from langchain.chat_models import ChatOpenAI

llm_predictor = LLMPredictor(llm=ChatOpenAI(model_name='gpt-3.5-turbo', temperature=0))
service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor)

llama_debug = LlamaDebugHandler(print_trace_on_end=True)
callback_manager = CallbackManager([llama_debug])
service_context = ServiceContext.from_defaults(callback_manager=callback_manager, llm_predictor=llm_predictor)

Finally, we can start creating indexes and observing the differences between them.

In all of the examples, I will use the as_query_engine() function that creates a Retriever instance with the default values for a given index retriever. Instead, I could manually create a retriever and configure the object specifying, for example, the number of documents to retrieve or metadata filters. However, I want to focus on the differences between the indexes, so I will use the default values.

Also, the indexing implementations mentioned in this article aren’t all available options. For example, keyword-based indexing can be implemented by GPTRAKEKeywordTableIndex, which uses the Rapid Automatic Keyword Extraction algorithm. When you select the general kind of index, look at the available implementations to find the most appropriate one.

When should you use GPTVectorStoreIndex?

First, we created a GPTVectorStoreIndex. GPTVectorStoreIndex creates numerical vectors from the text using word embeddings and retrieves relevant documents based on the similarity of the vectors.

When we index the documents, the library chunks them into 15 nodes and calls the embeddings endpoint of OpenAI API.

from llama_index import GPTVectorStoreIndex
index = GPTVectorStoreIndex.from_documents(documents, service_context=service_context)
query_engine = index.as_query_engine()

Trace: index_construction
    |_node_parsing ->  0.027615 seconds
      |_chunking ->  0.000429 seconds
      |_chunking ->  0.001618 seconds
      |_chunking ->  0.001659 seconds
      |_chunking ->  0.004154 seconds
      |_chunking ->  0.001352 seconds
      |_chunking ->  0.000939 seconds
      |_chunking ->  0.000948 seconds
      |_chunking ->  0.001772 seconds
      |_chunking ->  0.000962 seconds
      |_chunking ->  6.6e-05 seconds
      |_chunking ->  0.001523 seconds
      |_chunking ->  0.001792 seconds
      |_chunking ->  0.005623 seconds
      |_chunking ->  0.002196 seconds
      |_chunking ->  0.000866 seconds
    |_embedding ->  0.270817 seconds
    |_embedding ->  0.187781 seconds

The number of API calls during indexing depends on the amount of data. However, GPTVectorStoreIndex uses only the embeddings API which is the cheapest API provided by OpenAI.

Now, when we ask LlamaIndex to answer a question, it will create a vector from the question, retrieve relevant data, and pass the text to the LLM. The LLM will generate the answer using our question and the retrieved documents:

response = query_engine.query("...question...")

Trace: query
    |_query ->  5.698652 seconds
      |_retrieve ->  0.137124 seconds
        |_embedding ->  0.12821 seconds
      |_synthesize ->  5.561296 seconds
        |_llm ->  5.551707 seconds

Using GPTVectorStoreIndex, we can implement the most popular method of passing private data to LLMs which is to create vectors using word embeddings and find relevant documents based on the similarity between the documents and the question.

The GPTVectorStoreIndex implementation has an obvious advantage. It is cheap to index and retrieve the data. We can also reuse the index to answer multiple questions without sending the documents to LLM many times. The disadvantage is that the quality of the answers depends on the quality of the embeddings. If the embeddings are not good enough, the LLM will not be able to generate a good response. For example, I would recommend skipping the table of contents when you index the documents.

When should you use GPTListIndex?

The GPTListIndex index is perfect when you don’t have many documents. Instead of trying to find the relevant data, the index concatenates all chunks and sends them all to the LLM. If the resulting text is too long, the index splits the text and asks LLM to refine the answer.

from llama_index.indices.list import GPTListIndex
index = GPTListIndex.from_documents(documents, service_context=service_context)
query_engine = index.as_query_engine()

When we index the data, nothing really happens. The documents get split into nodes, but LlamaIndex doesn’t call the OpenAI API.

Trace: index_construction
    |_node_parsing ->  0.028038 seconds
      |_chunking ->  0.000233 seconds
      |_chunking ->  0.001512 seconds
      |_chunking ->  0.001511 seconds
      |_chunking ->  0.006192 seconds
      |_chunking ->  0.001856 seconds
      |_chunking ->  0.001432 seconds
      |_chunking ->  0.002345 seconds
      |_chunking ->  0.001805 seconds
      |_chunking ->  0.000964 seconds
      |_chunking ->  5.9e-05 seconds
      |_chunking ->  0.001576 seconds
      |_chunking ->  0.001805 seconds
      |_chunking ->  0.001571 seconds
      |_chunking ->  0.002446 seconds
      |_chunking ->  0.001002 seconds

The entire work happens when we send the question:

response = query_engine.query("...question...")

Trace: query
    |_query ->  11.953841 seconds
      |_retrieve ->  0.033188 seconds
      |_synthesize ->  11.920409 seconds
        |_llm ->  11.83114 seconds

GPTListIndex may be a good choice when we have a few questions to answer using a handful of documents. It may give us the best answer because AI will get all the available data, but it is also quite expensive. We pay per token, so sending all the documents to the LLM may not be the best idea.

When should you use GPTKeywordTableIndex?

The GPTKeywordTableIndex implementation extracts the keywords from indexed nodes and uses them to find relevant documents. When we ask a question, first, the implementation will generate keywords from the question. Next, the index searches for the relevant documents and sends them to the LLM.

from llama_index.indices.keyword_table import GPTKeywordTableIndex
index = GPTKeywordTableIndex.from_documents(documents, service_context=service_context)
query_engine = index.as_query_engine()

The bulk of the work happens at the indexing time. Every node is sent to the LLM to generate keywords. Of course, sending every document to an LLM skyrockets the cost of indexing. Not only because we pay for the tokens but also because calls to the Completion API of OpenAI take longer than their Embeddings API.

|_node_parsing ->  0.040063 seconds
      |_chunking ->  0.00028 seconds
      |_chunking ->  0.001599 seconds
      |_chunking ->  0.004401 seconds
      |_chunking ->  0.003369 seconds
      |_chunking ->  0.003774 seconds
      |_chunking ->  0.002919 seconds
      |_chunking ->  0.002929 seconds
      |_chunking ->  0.003231 seconds
      |_chunking ->  0.001713 seconds
      |_chunking ->  8.7e-05 seconds
      |_chunking ->  0.002394 seconds
      |_chunking ->  0.003203 seconds
      |_chunking ->  0.002635 seconds
      |_chunking ->  0.003253 seconds
      |_chunking ->  0.001448 seconds
    |_llm ->  1.042544 seconds
    |_llm ->  1.867183 seconds
    |_llm ->  2.661816 seconds
    |_llm ->  2.410908 seconds
    |_llm ->  1.998625 seconds
    |_llm ->  2.11114 seconds
    |_llm ->  2.260326 seconds
    |_llm ->  2.301919 seconds
    |_llm ->  2.335496 seconds
    |_llm ->  0.785303 seconds
    |_llm ->  2.75616 seconds
    |_llm ->  5.845675 seconds
    |_llm ->  2.305362 seconds
    |_llm ->  2.460298 seconds
    |_llm ->  2.065906 seconds

Interestingly, we can retrieve the content of the keyword index and see what keywords were generated for each node. Therefore, we can use LlamaIndex when we need only the keywords and are not interested in using the library for question answering.

index.index_struct

KeywordTable(index_id='9c35e0f8-9b45-41e2-ad89-dade51c754d6', summary=None, table={'lessons': {'4fd66a36-e8b8-4f86-b6d3-a3e94b545304'}, 'learned': {'4fd66a36-e8b8-4f86-b6d3-a3e94b545304'}, 'meeting': {'bf5b2775-759d-4cb2-9fd7-f9fca238dc3e', '05ebc2c1-7793-47d6-9cd8-48d4ba015c81', '2b0cddba-7425-4941-8f02-91f69cd26c86', '6ef26fd0-cd37-4b56-9561-fbbae2580d83'}, 'meetup': {'2b0cddba-7425-4941-8f02-91f69cd26c86', '5dd14ff8-bd55-471e-b03c-c052beef5fc7', '8fb47eb4-6475-43e3-b2a7-1c33ce5df94f'}, 'meetings': {'2b0cddba-7425-4941-8f02-91f69cd26c86', '8fb47eb4-6475-43e3-b2a7-1c33ce5df94f'}, 'scala': {'e2adaeb8-e495-4cdb-ae3f-8713a444f040', '2b0cddba-7425-4941-8f02-91f69cd26c86'}...})

When we send the question, the implementation generates the keywords, filters the nodes, and uses LLM to generate the response:

response = query_engine.query("...question...")

Trace: query
    |_query ->  8.29206 seconds
      |_retrieve ->  1.10838 seconds
        |_llm ->  1.106967 seconds
      |_synthesize ->  7.183447 seconds
        |_llm ->  7.169445 seconds

Because the implementation sends all nodes to the LLM, it is difficult for me to find a use case where I would prefer to use it. It’s slower and more expensive than the indexes we have seen so far. However, if I already had the keywords extracted, I assume I could overwrite the implementation of the _insert method to avoid sending the nodes to the LLM. But that’s a niche situation.

When should you use GPTKnowledgeGraphIndex?

The previous index was expensive and slow, but building a knowledge graph requires even more resources. GPTKnowledgeGraphIndex builds a knowledge graph with keywords and relations between them. Alternatively, we could use embeddings by specifying the retriever_mode parameter (KGRetrieverMode.EMBEDDING), but, as I mentioned earlier, I use the default values. The default behavior of GPTKnowledgeGraphIndex is based on keywords.

from llama_index.indices.knowledge_graph import GPTKnowledgeGraphIndex
index = GPTKnowledgeGraphIndex.from_documents(documents, service_context=service_context)
query_engine = index.as_query_engine()

Trace: index_construction
    |_node_parsing ->  0.023567 seconds
      |_chunking ->  0.000311 seconds
      |_chunking ->  0.001542 seconds
      |_chunking ->  0.00141 seconds
      |_chunking ->  0.004781 seconds
      |_chunking ->  0.001988 seconds
      |_chunking ->  0.000882 seconds
      |_chunking ->  0.000852 seconds
      |_chunking ->  0.001645 seconds
      |_chunking ->  0.00091 seconds
      |_chunking ->  5.6e-05 seconds
      |_chunking ->  0.00145 seconds
      |_chunking ->  0.001765 seconds
      |_chunking ->  0.001515 seconds
      |_chunking ->  0.002026 seconds
      |_chunking ->  0.000818 seconds
    |_llm ->  5.933608 seconds
    |_llm ->  3.438587 seconds
    |_llm ->  3.846163 seconds
    |_llm ->  7.778094 seconds
    |_llm ->  9.76739 seconds
    |_llm ->  5.346881 seconds
    |_llm ->  5.278296 seconds
    |_llm ->  10.466592 seconds
    |_llm ->  4.773259 seconds
    |_llm ->  7.595503 seconds
    |_llm ->  6.565589 seconds
    |_llm ->  6.697721 seconds
    |_llm ->  7.449141 seconds
    |_llm ->  9.516032 seconds
    |_llm ->  4.958723 seconds

The retrieval part is similar to the keyword-based index. First, we get the keywords from the question. After that, we search for the relevant nodes using the knowledge graph and pass the documents to the LLM to generate the answer.

response = query_engine.query("...question...")

Trace: query
    |_query ->  4.339767 seconds
      |_retrieve ->  1.191037 seconds
        |_llm ->  1.176236 seconds
      |_synthesize ->  3.148483 seconds
        |_llm ->  3.070309 seconds

However, the GPTKnowledgeGraphIndex comes with an additional benefit. We can retrieve the knowledge graph, and we can even visualize it. The networkx library is required to run the code below!

import networkx as nx
import matplotlib.pyplot as plt
fig = plt.figure(1, figsize=(100, 40), dpi=30)
nx.draw_networkx(index.get_networkx_graph(), font_size=18)

As the knowledge graph-based indexing method preserves relations between nodes, it may give us better results than keywords alone. However, building the index takes a lot of time, even when you have only three Markdown documents in the directory. However, having the keywords extracted and the documents clustered using the knowledge graph may be useful in some cases.

Go From AI Janitor to AI Architect

Stop debugging unpredictable AI systems. I can help you build, measure, and deploy reliable, production-grade AI applications that don't hallucinate.

Message me on LinkedIn