Understanding Retrieval-Augmented Generation: Part 1

January 25, 2024 • Written By Sherlock Xu

Imagine you are a contestant on a competitive cooking show (like Hell’s Kitchen), required to create a dish that’s not only delicious but also tells a unique story. You already have some cooking skills thanks to your past training experience, but what if you could freely access a global library of recipes, regional cooking techniques, and even flavor combinations? That’s where your sous-chef, equipped with a vast culinary database, steps in. This sous-chef doesn’t just bring you ingredients; she also brings specialized knowledge and inspiration, helping you transforming your cooking into a masterpiece that tells a unique, flavorful story.

This is the essence of Retrieval-Augmented Generation, or RAG in the AI world. Like the sous-chef who elevates your cooking with a wealth of custom resources, RAG enhances the capabilities of large language models (LLMs). It’s not just about responding to queries based on pre-existing knowledge; RAG allows the model to dynamically access and incorporate a vast range of external information, just like tapping into a global culinary database for that unique recipe.

As a partner of LlamaIndex RAG Hackathon, we will release a two-article blog series about RAG to help the BentoML community gain a better understanding of its concepts and usage. In this first post, we will explore the mechanics of this technology, its benefits, as well as the challenges it faces, offering a comprehensive taste of how RAG is redefining the boundaries of AI interactions.

RAG 101

Patrick Lewis and his colleagues at Meta first proposed the concept of RAG in the 2020 paper Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. At its core, RAG has two important components: the retrieval system and the language model.

  • The retrieval system: This is like the data corpus of RAG. The retrieval system scans through extensive databases of information to find the most relevant and useful data that can enhance the response to a query. This process is similar to selecting the perfect ingredients for a recipe, ensuring that each one contributes to the final flavor profile.
  • The language model: This is the chef who knows how to combine ingredients into a dish. The language model takes the information sourced by the retrieval system and integrates it into contextually relevant responses.

Traditional language models are like chefs working with a fixed set of ingredients. They can create impressive dishes (responses) based on what they have (their training data), but they are limited to those ingredients. RAG, on the other hand, has the ability to constantly source new ingredients (information), making dishes (responses) far more diverse, accurate, and rich.

Embeddings and vector databases in RAG

In the world of RAG, when a user poses a question, it involves a complex computational process, where embeddings and vectors databases play important roles.


The first step in RAG’s retrieval process involves translating the user’s query into a format that the AI model can understand. This is often done through embeddings or vectors. An embedding is essentially a numeric representation of the query, capturing not just the words, but their context and semantic meaning. Think of it as translating a recipe request into a list of necessary flavor profiles and cooking skills.

Note: Previously, we published two blog posts on creating sentence embedding and image embedding applications with BentoML respectively. Read them for more details.

Embeddings allow the AI model to process and compare the query against a vast array of stored data efficiently. This process is similar to a chef understanding the essence of a dish and then knowing exactly what ingredients and techniques to use.

Vector databases

After you have the embeddings, the next crucial component is the vector database.

Vector databases in RAG store a massive amount of pre-processed information, each piece also represented as embeddings. When the AI model receives a query, it uses these embeddings to search through the database, looking for matches or closely related information.

The use of vector databases allows RAG to search through and retrieve relevant information with decent speed and precision. It’s like having an instant global connection to different flavors and ingredients, each cataloged not just by name, but by their taste profiles and culinary uses.

Ultimately, the embeddings, the vector database, and the language model work together to make sure the final response is a well-thought-out answer that blends the retrieved information with the AI’s pre-trained data.

The benefits of RAG

RAG comes with a number of benefits. To name a few:

  • Enhanced accuracy. By leveraging up-to-date external information, RAG ensures that the answers are not only contextually relevant but also enriched with the latest data. This is particularly important in fields like medicine, technology, and finance.
  • Dynamically updated Information. Unlike traditional models that rely solely on their training data, RAG models can access and incorporate dynamically updated information. Providing the information does not directly modify the underlying language model itself, without incurring any additional training costs.
  • Source attribution: Since the retrieval system knows which documents or text snippets it has pulled from the database, it can provide this information along with the generated response. This provides an extra layer of transparency and trust to the responses generated.
  • Personalized interactions. RAG has the potential for more personalized AI interactions. As the system understands and incorporates specific details from users’ queries, it can provide responses that are more aligned with individual needs and preferences.

The implications of RAG's benefits extend far beyond just improved answers. They represent an important shift in how we interact with AI, transforming it into a tool capable of providing informed, accurate, and contextually rich interactions. This opens up new possibilities in education, customer service, research, and any other fields where access to updated, relevant information is important.

Challenges and limitations

Key challenges of RAG include:

  • Data retrieval complexity. One of the primary challenges in RAG is ensuring the accuracy and relevance of the data retrieved. While RAG is able to pull in vast amounts of information, filtering this data to find the most pertinent pieces can be complex. Ensuring the retrieval system can understand the nuances of different queries is important but not always easy.
  • Balancing relevance with reliability. The retrieval system may have access to a wide range of data, but not all sources are equally trustworthy. Therefore, it is important to balance the relevance of the latest information with the reliability and credibility of sources. This may require developing mechanisms to evaluate and prioritize reliable sources.
  • Computational resources and costs. RAG systems, particularly those handling large datasets and complex queries, require substantial computational resources. The process of retrieving, processing, and integrating external information in real time can be computationally intensive, leading to higher operational costs and potential efficiency concerns.
  • Future-proofing. Ensuring that RAG systems remain effective and up-to-date over time is another challenge. As information sources and user expectations evolve, RAG systems must also adapt and scale accordingly, which requires ongoing development and maintenance.


Despite these challenges, there is great potential of RAG in transforming AI interactions. Its role in enhancing AI’s capabilities is undeniable, and the journey to refine this technology further is both challenging and exciting.

In the next article, we will explore the real-world applications of RAG and its future outlook.