Retrieval Augmented Generation

What Is Retrieval Augmented Generation, or RAG?

Retrieval augmented generation, or RAG, is an architectural approach that can improve the efficacy of large language model (LLM) applications by leveraging custom data. This is done by retrieving data/documents relevant to a question or task and providing them as context for the LLM. RAG has shown success in support chatbots and Q&A systems that need to maintain up-to-date information or access domain-specific knowledge.

Here’s more to explore

2022-06-Big-Book-of-MLOps-TY-TN-362x190-2x

The Big Book of MLOps

A must-read for ML engineers and data scientists seeking a better way to do MLOps.

Get the eBook

Augment your LLM's using RAG

How to get more from generative AI with RAG.

Download now

Databricks ranked #1 in Execution and Vision

2025 Gartner Magic Quadrant™ for Data Science and Machine Learning Platforms.

Read now

What challenges does the retrieval augmented generation approach solve?

Problem 1: LLM models do not know your data

LLMs use deep learning models and train on massive datasets to understand, summarize and generate novel content. Most LLMs are trained on a wide range of public data so one model can respond to many types of tasks or questions. Once trained, many LLMs do not have the ability to access data beyond their training data cutoff point. This makes LLMs static and may cause them to respond incorrectly, give out-of-date answers or hallucinate when asked questions about data they have not been trained on.

**Problem 2: AI applications must leverage custom data to be effective**

For LLMs to give relevant and specific responses, organizations need the model to understand their domain and provide answers from their data vs. giving broad and generalized responses. For example, organizations build customer support bots with LLMs, and those solutions must give company-specific answers to customer questions. Others are building internal Q&A bots that should answer employees' questions on internal HR data. How do companies build such solutions without retraining those models?

Solution: Retrieval augmentation is now an industry standard

An easy and popular way to use your own data is to provide it as part of the prompt with which you query the LLM model. This is called retrieval augmented generation (RAG), as you would retrieve the relevant data and use it as augmented context for the LLM. Instead of relying solely on knowledge derived from the training data, a RAG workflow pulls relevant information and connects static LLMs with real-time data retrieval.

With RAG architecture, organizations can deploy any LLM model and augment it to return relevant results for their organization by giving it a small amount of their data without the costs and time of fine-tuning or pretraining the model.

What are the use cases for RAG?

There are many different use cases for RAG. The most common ones are:

Question and answer chatbots: Incorporating LLMs with chatbots allows them to automatically derive more accurate answers from company documents and knowledge bases. Chatbots are used to automate customer support and website lead follow-up to answer questions and resolve issues quickly.
Search augmentation: Incorporating LLMs with search engines that augment search results with LLM-generated answers can better answer informational queries and make it easier for users to find the information they need to do their jobs.
Knowledge engine — ask questions on your data (e.g., HR, compliance documents): Company data can be used as context for LLMs and allow employees to get answers to their questions easily, including HR questions related to benefits and policies and security and compliance questions.

What are the benefits of RAG?

The RAG approach has a number of key benefits, including:

Providing up-to-date and accurate responses: RAG ensures that the response of an LLM is not based solely on static, stale training data. Rather, the model uses up-to-date external data sources to provide responses.
Reducing inaccurate responses, or hallucinations: By grounding the LLM model's output on relevant, external knowledge, RAG attempts to mitigate the risk of responding with incorrect or fabricated information (also known as hallucinations). Outputs can include citations of original sources, allowing human verification.
Providing domain-specific, relevant responses: Using RAG, the LLM will be able to provide contextually relevant responses tailored to an organization's proprietary or domain-specific data.
Being efficient and cost-effective: Compared to other approaches to customizing LLMs with domain-specific data, RAG is simple and cost-effective. Organizations can deploy RAG without needing to customize the model. This is especially beneficial when models need to be updated frequently with new data.

When should I use RAG and when should I fine-tune the model?

RAG is the right place to start, being easy and possibly entirely sufficient for some use cases. Fine-tuning is most appropriate in a different situation, when one wants the LLM's behavior to change, or to learn a different "language." These are not mutually exclusive. As a future step, it's possible to consider fine-tuning a model to better understand domain language and the desired output form — and also use RAG to improve the quality and relevance of the response.

When I want to customize my LLM with data, what are all the options and which method is the best (prompt engineering vs. RAG vs. fine-tune vs. pretrain)?

There are four architectural patterns to consider when customizing an LLM application with your organization's data. These techniques are outlined below and are not mutually exclusive. Rather, they can (and should) be combined to take advantage of the strengths of each.

Method	Definition	Primary use case	Data requirements	Advantages	Considerations
Prompt engineering	Crafting specialized prompts to guide LLM behavior	Quick, on-the-fly model guidance	None	Fast, cost-effective, no training required	Less control than fine-tuning
Retrieval augmented generation (RAG)	Combining an LLM with external knowledge retrieval	Dynamic datasets and external knowledge	External knowledge base or database (e.g., vector database)	Dynamically updated context, enhanced accuracy	Increases prompt length and inference computation
Fine-tuning	Adapting a pretrained LLM to specific datasets or domains	Domain or task specialization	Thousands of domain-specific or instruction examples	Granular control, high specialization	Requires labeled data, computational cost
Pretraining	Training an LLM from scratch	Unique tasks or domain-specific corpora	Large datasets (billions to trillions of tokens)	Maximum control, tailored for specific needs	Extremely resource-intensive

Regardless of the technique selected, building a solution in a well-structured, modularized manner ensures organizations will be prepared to iterate and adapt. Learn more about this approach and more in The Big Book of MLOps.

What is a reference architecture for RAG applications?

There are many ways to implement a retrieval augmented generation system, depending on specific needs and data nuances. Below is one commonly adopted workflow to provide a foundational understanding of the process.

Prepare data: Document data is gathered alongside metadata and subjected to initial preprocessing — for example, PII handling (detection, filtering, redaction, substitution). To be used in RAG applications, documents need to be chunked into appropriate lengths based on the choice of embedding model and the downstream LLM application that uses these documents as context.
Index relevant data: Produce document embeddings and hydrate a Vector Search index with this data.
Retrieve relevant data: Retrieving parts of your data that are relevant to a user's query. That text data is then provided as part of the prompt that is used for the LLM.
Build LLM applications: Wrap the components of prompt augmentation and query the LLM into an endpoint. This endpoint can then be exposed to applications such as Q&A chatbots via a simple REST API.

Databricks also recommends some key architectural elements of a RAG architecture:

Vector database: Some (but not all) LLM applications use vector databases for fast similarity searches, most often to provide context or domain knowledge in LLM queries. To ensure that the deployed language model has access to up-to-date information, regular vector database updates can be scheduled as a job. Note that the logic to retrieve from the vector database and inject information into the LLM context can be packaged in the model artifact logged to MLflow using MLflow LangChain or PyFunc model flavors.
MLflow LLM Deployments or Model Serving: In LLM-based applications where a third-party LLM API is used, the MLflow LLM Deployments or Model Serving support for external models can be used as a standardized interface to route request from vendors such as OpenAI and Anthropic. In addition to providing an enterprise-grade API gateway, the MLflow LLM Deployments or Model Serving centralizes API key management and provides the ability to enforce cost controls.
Model Serving: In the case of RAG using a third-party API, one key architectural change is that the LLM pipeline will make external API calls, from the Model Serving endpoint to internal or third-party LLM APIs. It should be noted that this adds complexity, potential latency and another layer of credential management. By contrast, in the fine-tuned model example, the model and its model environment will be deployed.

Resources

Databricks blog posts
- Using MLflow AI Gateway and Llama 2 to Build Generative AI Apps
- Best Practices for LLM Evaluation of RAG Applications
Databricks Demo
Databricks eBook — The Big Book of MLOps

Databricks customers using RAG

JetBlue

JetBlue has deployed "BlueBot," a chatbot that uses open source generative AI models complemented by corporate data, powered by Databricks. This chatbot can be used by all teams at JetBlue to get access to data that is governed by role. For example, the finance team can see data from SAP and regulatory filings, but the operations team will only see maintenance information.

Chevron Phillips

Chevron Phillips Chemical uses Databricks to support their generative AI initiatives, including document process automation.

Thrivent Financial

Thrivent Financial is looking at generative AI to make search better, produce better summarized and more accessible insights, and improve the productivity of engineering.

Where can I find more information about retrieval augmented generation?

There are many resources available to find more information on RAG, including:

Blogs

E-books

Demos

Deploy Your LLM Chatbot With Retrieval Augmented Generation (RAG), llama2-70B (MosaicML Inferences) and Vector Search

Contact Databricks to schedule a demo and talk to someone about your LLM and retrieval augmented generation (RAG) projects

Back to Glossary