Choose the best embedding model for your Retrieval-augmented generation (RAG) system

Written by Enterprise Bot | Aug 15, 2024 3:46:25 PM

Retrieval-augmented generation (RAG) systems augment an LLM's inherent knowledge with external data such as company knowledge bases, up-to-date web pages, and other data sources not included in the training process for that LLM.

A step in this augmentation process is to turn the raw data into vectors using an embedding model. Most LLM providers like OpenAI offer one or more embedding models, but embedding models are available from many other sources, too.

This guide explains how embeddings work and why you need them, dispels some common misconceptions about embeddings, and provides the information you need to choose the embedding model that meets your needs.

Why do we need retrieval-augmented generation?

You've likely experimented with the conversational interfaces of modern large language models (LLMs) by now and discovered they are both impressive and quite unusable in an enterprise product. While an LLM is capable of seamlessly answering university exam-level questions interactively and consolidating information more effectively than an internet search, it also has the potential to fabricate information and generate inaccurate or misleading content.

This erratic behavior is to be expected. LLMs are trained on enormous data sets. They develop the capability to interact using language and a form of general knowledge they draw on for generating text. But there is no guarantee that they will be able to provide any particular piece of information, especially about a topic that isn't general knowledge (like the details of your company).

This limitation is a problem for business use cases. If you're considering using an LLM in an internal knowledge management tool or as a chatbot, search engine, or other feature on your customer-facing product, you need the LLM to be able to provide the correct information specific to your company, and for the information to be up to date.

So how do you merge the powerful conversational capabilities and general knowledge of a generic LLM with information relevant to your company? That is where retrieval augmented generation comes in.

A retrieval-augmented generation system enhances the requests made to an LLM with information retrieved from your company's store of knowledge to provide the best of both worlds: The power of sophisticated, cutting-edge language models combined with your company-specific information.

Myth-busting some common misconceptions about LLMs and Embeddings

Before we help you choose the correct embedding model, let’s dispel some common misconceptions.

You don't need to fine-tune an LLM to get custom results

When considering customizing the output of a generative AI solution, it's common to assume that it's necessary to train a custom model or fine-tune an existing LLM. But neither of these processes is essential to get good quality results for your product or business.

Training or fine-tuning LLMs are expensive undertakings that require significant expertise and resources. Fortunately, you aren't likely to need either for your use case. Combining an off-the-shelf LLM with custom data via a retrieval-augmented generation solution will get you the results you need, and this architecture is easier to maintain. Adding more knowledge to your system is as simple as updating the embeddings rather than fine-tuning the LLM again.

You don’t need to use the embedding model offered by your LLM provider

Retrieval-augmented generation solutions use embeddings to find chunks of relevant information and pass this information (usually as normal text) to the LLM along with a user query.

This means you can use one LLM for the conversational interface that users interact with, and an entirely different embedding model for the retrieval component of your retrieval-augmented generation system.

You can use a cloud-service LLM for the conversational piece and an open-source embedding model hosted locally for the retrieval piece, or vice versa. The conversational component and the information retrieval component are independent.

And to go a step further, you could even implement the retrieval component without embeddings, using older natural language processing techniques. You probably want to use modern embedding methods, but you don't have to.

Cloud embeddings are surprisingly inexpensive (and fast)

Cloud embedding models typically charge per token. If you have a large knowledge base of hundreds of thousands of documents, you might assume that it’s never going to be economically viable to pass your entire knowledge base through an embedding model.

But the current price for OpenAI’s text-embedding-3-small model is $0.02 per million tokens. Let's see what this means in practical terms.

Tokens are a bit hard to reason about as each word might be multiple tokens, but as a rule of thumb, you can estimate 750 words to be about 1000 tokens.

Say your company's knowledge base is made up of 100,000 documents of approximately 750 words each, around 100 million tokens in total. Embedding the entire knowledge base would cost $2.

OpenAI's text-embedding-3-large, a more advanced model that produces larger embeddings, costs $0.13 per million tokens, meaning it would cost you $13 to embed the same knowledge base.

In terms of speed, you can get the embeddings for a single document in a matter of milliseconds. Using text-embedding-3-small, we embedded 100 documents of about 3,500 words each in 42 seconds. And because embeddings for each document are independent, you can speed the process up by requesting embeddings in parallel.

How does retrieval-augmented generation work?

At a high level, a retrieval-augmented generation system takes an input question or search string and searches your company's knowledge base for the best matching content to answer the question. The matched information is then sent, along with the original question, in a prompt to an LLM. The LLM uses the original question together with the information supplied by the retrieval-augmented generation system to answer the original question in a coherent way.

A retrieval-augmented generation example

Consider an e-commerce chatbot handling customer queries. A customer asks, "What hat sizes do you stock?" When the question is received, the retrieval-augmented generation system searches the company's knowledge base for text that best matches the phrase, "what hat sizes do you stock?" The retrieval-augmented generation system then combines the text of the customer's request with the retrieved hat-relevant information in a prompt to an LLM, which generates a conversational response to the customer's question.

Conversational tasks are where LLMs excel. But what about the prior step, finding the appropriate information in your knowledge base to provide the LLM with the content it needs to answer the question? This is an information retrieval problem.

A simple information retrieval solution might match content by words or tags. This is lexical search, the approach used by early search algorithms. But it's easy to anticipate cases where this sort of search breaks down, for example, when the exact phrasing of a question (or spelling of a word) doesn't match the phrasing of the content in your knowledge base.

Semantic search is an approach that solves the problem of exact keyword matching by matching content based on the meaning (semantics) of the text. Instead of matching on words, semantic search matches on embeddings.

Embeddings are vectors (lists of numbers) that capture the meaning of a piece of text. A semantic search converts the input text into an embedding (a process called vectorization) and compares that embedding to the vector embeddings of the content being searched. Content with the most similar vector is returned.

How do embeddings work?

Turning a paragraph of text into a list of numbers that represent the meaning of the content seems like a magical property, but this is what LLMs are trained to do: build numerical representations of text where vectors of texts with similar meanings are mathematically close together.

Modern LLMs like GPT have embedding models at their core, having learned numerical representations of text. But simpler embedding models exist too, that don't have all the extra generative AI conversation capabilities.

Every embedding model (including an LLM) has its own unique way of representing text as vectors due to its distinct model architecture and specific training data. This means that the exact numerical vectors created by different models from the same text can be entirely unalike.

As a result, embeddings only make sense within their own context. You can’t compare embeddings created by one model to those created by another, so if you change your embedding model, you need to re-vectorize your entire knowledge base with the new model.

At Enterprise Bot, we use local embedding models, so your knowledge base is kept private and not shared with external providers.

Understanding the R (retrieval) in retrieval-augmented generation

What does information retrieval look like in practice?

Firstly, your knowledge base is broken down into chunks of content, and those chunks are vectorized into embeddings.

Embeddings are stored alongside the original content, often in purpose-built vector databases, for searching against. This bulk vectorizing of your knowledge store can be done offline, and periodically updated in batches.

With the vector store in place, your retrieval-augmented generation can now use the same model in real-time to transform user queries into vectors and search the vector database to retrieve the most relevant context. The LLM can then use that knowledge to augment its generation.

The primary advantage of using a retrieval-augmented generation system is that the retrieval component is self-contained. In other words, the conversational portion of the solution can be decoupled from the knowledge retrieval portion. We can use different models for each.

The knowledge retrieval component of the system must use the same embedding model to encode your knowledge base and to encode the question that is being matched against your knowledge system. But the conversational component of the system that interacts with a user (in a chatbot use case) or consolidates information into a readable answer to a question (in a more general information retrieval use case) can be an entirely different model. This is possible because the conversational model doesn't need to interact with the embeddings in the retrieval component of your retrieval-augmented generation system.

This is good news for a number of reasons:

Privacy: You likely don't want to send your full knowledge base to a cloud-service LLM and you don't have to. You can host your embedding model locally and only send smaller chunks of retrieved knowledge to the cloud LLM. For retrieval-augmented generation systems, you need to process your entire knowledge base using the embedding model, but only small chunks of this (relevant to each user query) will be sent to the LLM. Being able to separate the retrieval component from the conversational component allows you to choose an embedding solution that meets your privacy needs, ideally without restricting your choice of LLM.
Cost: You have the flexibility to choose the most cost-effective model for the knowledge retrieval embeddings. Although embedding models are surprisingly cheap, they still charge per token and costs might still get out of hand at a large scale.
Upgrades: Decoupling the conversational model from the embedding model gives you the flexibility to change, expand, and upgrade each model in a way that is appropriate to its function. You may want to upgrade to the latest conversational model to take advantage of new capabilities, but have no need to upgrade the information retrieval model, or vice versa.

How do I choose the right embedding model for my retrieval-augmented generation system?

Your options when selecting an embedding model are:

Self-host an open-source model.
Use a cloud-based provider of open-source models.
Use a proprietary cloud-based solution, such as OpenAI.
Use an integrated end-to-end solution provider that includes embeddings, like Enterprise Bot.

Embeddings can vary in quality. One way to assess the quality of embedding models is to consult the Hugging Face Massive Text Embedding Benchmark (MTEB) Public Leaderboard. Creators of embedding models can apply a set of standardized benchmark tests to their models, and submit the results to the leaderboard.

The tabs on the leaderboard table to look at are "Overall" and "Retrieval". Some metrics that will help you choose a model are:

"Average" and "Retrieval Average" - Showing models' average performance score against the benchmarks. A model's "Retrieval Average" is important when choosing an embedding model for a retrieval-augmented generation system. View individual models' performances across the retrieval benchmark datasets on the "Retrieval" tab of the leaderboard.
"Model Size" and "Memory Usage" - "Model Size" refers to the number of parameters that were trained to build the model. "Memory Usage" refers to how much memory is required to run the model.
"Embedding Dimensions" - The length of the vector embeddings used in the model. Longer embeddings can lead to more accurate models, but also result in a larger vectorized knowledge base.
"Max tokens" - Provides an estimate of the amount of text that can be represented by a single embedding.

It may seem that the best model to choose will just be the model at the top of the leaderboard at any given time, but this isn't the case. While getting the most relevant context for a given user query is vital for good results, the actual quality of the embeddings is not as relevant as other parts of your solution.

Self-host or use a cloud service for embeddings?

Using a cloud service usually means accessing an embedding model via an SDK or an API. Self-hosting could mean hosting an open-source model on public cloud infrastructure that you manage, or it could mean using your own on-prem hardware.

Some questions to consider when making this decision:

How cutting-edge do you need your models to be?

Many cutting-edge embedding models are proprietary and provided as cloud services. So if you want to use the latest models, a cloud-based option may be your best choice. But for a retrieval embedding model in a retrieval-augmented generation, having the latest model may not be necessary (and could even be a disadvantage compared to something that is better tested and better understood).

What is the expertise and capacity of your engineering team?

Whether or not self-hosting is a viable option for you will depend on the capacity of your engineering team. Working with embedding models requires specialist engineering skills, including:

AI and ML engineering

Even though open-source models are available off the shelf, you’ll still need AI and ML experts to ensure that your model is set up correctly and performing as intended. You’ll probably want to run standardized benchmarks against your own embedding solution to ensure that it is performing on par with cloud options.

Infrastructure and DevOps engineering

Hosting models yourself will require in-house DevOps and infrastructure engineering, and possibly hardware expertise if you run your own AI hardware. You’ll want these models to be always available and have low latency (because each time a user asks your chatbot a question, they will need to wait for their question to be vectorized by your embedding solution).

Infrastructure and DevOps teams will need to maintain and update your embedding system, including implementing monitoring and observability metrics to ensure that it performs reliably over time.

How tightly do you need to control the costs?

Both cloud-service and self-hosted options come at a cost. In the case of cloud services, the cost is usually attached directly to the requests you make. For a self-hosted model, the cost is in the infrastructure you will need to host the model and the costs of your engineering team.

A benefit of self-hosting is that you are more in control of these costs. You can choose the size of your infrastructure and you are not subject to cloud costs changing over time. On the other hand, if your company is still experimenting with using AI and you don't yet know what you want to develop, cloud services let you get started without needing to build a full system in-house.

How important is latency?

If you have a very low-latency requirement use case, you may need to choose a self-hosted option, as you cannot control the latency of cloud service providers. Remember that you call an embedding model in two places - to vectorize your knowledge store, and in real time to vectorize the questions or search strings that are sent to the retrieval-augmented generation system for comparison to the vector store. The latency of the knowledge store vectorization usually isn't very important, but the latency of the real-time vectorization of input queries can be. And remember, you need to use the same model to do the vectorization of the entire dataset and the online vectorization of each query.

How important is privacy?

If privacy is particularly important to your company, perhaps because of the regulatory environment you work in, you may have to use self-hosted models.

Privacy requirements can be assessed separately for the retrieval and conversational components of your retrieval-augmented generation system. Even if you use a proprietary cloud-service LLM for the conversational component of your retrieval-augmented generation, you may still want to keep more control of the embedding component because it operates over your full knowledge base. This might lead to the accidental inclusion of IP, trade secrets, or personally identifying information (PII) about employees or customers.

If your entire system needs to run on-premise, without any user data being sent to the cloud, then self-hosting (or buying an off-the-shelf solution like Enterprise Bot) is your only choice.

A common solution to the self-hosting versus cloud-service question is to start your initial exploration using a cloud service, where the costs might be relatively higher but you have lots of flexibility. Then when you have figured out how you will use your retrieval-augmented generation system, you can take the step of moving to a self-hosted model. But you will have to consider your specific situation and stage to decide what is best for your company.

What languages do you need to support?

If you’re using the leaderboard to choose an embedding model, you’ll see it shows benchmarks for a variety of languages. If you need a model for one of the languages being tracked in the Retrieval leaderboard (currently English, Chinese, French, and Polish), you can look for models there. If you need a truly multilingual model, search “multilingual” in the Leaderboard search bar to see the multilingual models, some of which support 100+ languages.

Cloud models like those offered by OpenAI claim to support multiple languages, but OpenAI does not provide a lot of information about how the models were trained, so you might need to verify how well they work in your specific case.

How much infrastructure can you reasonably support?

The "Model Size" and "Memory Usage" columns in the benchmark table are important for practical implementation reasons. Larger models need greater infrastructure resources to run, and therefore come at a higher cost. This is true for self-hosted as well as cloud-service models, although cloud-service models don't usually report their size on the leaderboard. Larger models also have longer inference times, which will increase the latency of the real-time component of your retrieval-augmented generation system.

If you have decided to self-host, an important consideration is if the model you choose can be run on consumer hardware or if you will need enterprise-level hardware. At the time of writing this article, leaderboard model sizes range from very small (a few hundred MB) to over 100 GB.

Note that it is not necessarily the case that the largest model will provide the best results. A high-performing small model may well be good enough for your retrieval-augmented generation embeddings. For example, Open AI's small embedding model, text-embedding-3-small, has an average MTEB score of 62.3. While this score is lower than the 64.6 achieved by the text-embedding-3-large model, it is considerably more cost-efficient. The text-embedding-3-small model costs $0.02/1 M tokens and $0.01/1 M tokens in batch mode (batch mode queries are suitable for generating a vector store slowly in bulk). The cost of the text-embedding-3-large is roughly six times this, $0.13/1 M tokens and $0.07/1 M tokens in batch mode.

Understanding embedding model quality metrics

The overall performance of a model against the benchmarks can be found in the "Retrieval Average" column of the MTEB leaderboard. However, a word of caution here: The MTED scores are self-reported, and can sometimes be biased by subtleties like evaluation data being used to train the models. The scores also vary between datasets.

A good approach is to choose a few models that broadly meet your requirements in terms of self-hosted versus cloud service, language, cost, and model size, and systematically test them on a subset of your own data.

So how do I choose the right embedding model for my retrieval-augmented generation system?

There is unfortunately no single correct answer to how to choose the right embedding model for your retrieval-augmented generation system. And even if there was, the language embedding field is developing so fast that today's best choice might be far from the best choice in six months. You need to consider the factors above, and evaluate their importance against your requirements and constraints. You also need to do some experimenting with the models against the content in your own knowledge base to find the best option for you.

If you are still experimenting with your use case and data sources, you may want to start out either with cloud-service models or a smaller open-source model that can be hosted on consumer hardware. Once you have concrete plans for your retrieval-augmented generation system, you may want to move to using your own hardware for cost or privacy reasons. Or you might do exactly the opposite, and stick with a cloud service for ease of use and easy upgrading when the models improve.

Do you need a turn-key solution to build enterprise conversational AI?

If you don’t want to go through the process of evaluating different embedding models, we offer conversational AI bots that work out of the box. You can simply add all of your internal data, and we will do the embedding for you, fully privately and on-premise, with our patent-pending retrieval-augmented generation solution, DocBrain, and our advanced knowledge-management capabilities.

Unlike other platforms, we use our own embedding models. This means we don’t send your full knowledge base to a third-party provider like OpenAI when converting your knowledge base into embeddings.

Book your demo today to find out how we can help you implement a generative AI solution in days.

View full post