Using MLflow AI Gateway and Llama 2 to Build Generative AI Apps

Achieve greater accuracy using Retrieval Augmented Generation (RAG) with your own data

Published: August 24, 2023

by Kasey Uhlenhuth, Xiangrui Meng, Hagay Lupesko, Sean Owen, Corey Zumar, Liang Zhang, Ina Koleva, Vladimir Kolovski and Arpit Jasapara

To build customer support bots, internal knowledge graphs, or Q&A systems, customers often use Retrieval Augmented Generation (RAG) applications which leverage pre-trained models together with their proprietary data. However, the lack of guardrails for secure credential management and abuse prevention prohibits customers from democratizing access and development of these applications. We recently announced the MLflow AI Gateway, a highly scalable, enterprise-grade API gateway that enables organizations to manage their LLMs and make them available for experimentation and production. Today we are excited to announce extending the AI Gateway to better support RAG applications. Organizations can now centralize the governance of privately-hosted model APIs (via Databricks Model Serving), proprietary APIs (OpenAI, Co:here, Anthropic), and now open model APIs via MosaicML to develop and deploy RAG applications with confidence.

In this blog post, we'll walk through how you can build and deploy a RAG application on the Databricks Lakehouse AI platform using the Llama2-70B-Chat model for text generation and the Instructor-XL model for text embeddings, which are hosted and optimized through MosaicML's Starter Tier Inference APIs. Using hosted models allows us to get started quickly and have a cost-effective way to experiment with low throughput.

The RAG application we're building in this blog answers gardening questions and gives plant care recommendations.

If you're already familiar with RAG and just want to see an example of how it all comes together, check out our demo (with example notebooks) that shows you how to use Llama2, MosaicML inference, and vector search to build your RAG application on Databricks.

What is RAG?

RAG is a popular architecture that allows customers to improve model performance by leveraging their own data. This is done by retrieving relevant data/documents and providing them as context for the LLM. RAG has shown success in chatbots and Q&A systems that need to maintain up-to-date information or access domain-specific knowledge.

Use the AI Gateway to put guardrails in place for calling model APIs

The recently announced MLflow AI Gateway allows organizations to centralize governance, credential management, and rate limits for their model APIs, including SaaS LLMs, via an object called a Route. Distributing Routes allows organizations to democratize access to LLMs while also ensuring user behavior doesn't abuse or take down the system. The AI Gateway also provides a standard interface for querying LLMs to make it easy to upgrade models behind routes as new state-of-the-art models get released.

We typically see organizations create a Route per use case and many Routes may point to the same model API endpoint to make sure it is getting fully utilized.

For this RAG application, we want to create two AI Gateway Routes: one for our embedding model and another for our text generation model. We are using open models for both because we want to have a supported path for fine-tuning or privately hosting in the future to avoid vendor lock-in. To do this, we will use MosaicML's Inference API. These APIs provide fast and easy access to state-of-the-art open source models for rapid experimentation and token-based pricing. MosaicML supports MPT and Llama2 models for text completion, and Instructor models for text embeddings. In this example, we will use Llama2-70b-Chat, which was trained on 2 trillion tokens and fine-tuned for dialogue, safety, and helpfulness by Meta and Instructor-XL, a 1.2B parameter instruction fine-tuned embedding model by HKUNLP.

It's easy to create a route for Llama2-70B-Chat using the new support for MosaicML Inference APIs on the AI Gateway:

Similarly to the text completion route configured above, we can create another route for Instructor-XL available through MosaicML Inference API

To get an API key for MosaicML hosted models, sign up here.

Use LangChain to piece together retriever and text generation

Now we need to build our vector index from our document embeddings so that we can do document similarity lookups in real-time. We can use LangChain and point it to our AI Gateway Route for our embedding model:

We then need to stitch together our prompt template and text generation model:

The RetrievalQA chain chains the two components together so that the retrieved documents from the vector database seed the context for the text summarization model:

You can now log the chain using MLflow LangChain flavor and deploy it on a Databricks CPU Model Serving endpoint. Using MLflow automatically provides model versioning to add more rigor to your production process.

After completing proof-of-concept, experiment to improve quality

Depending on your requirements, there are many experiments you can run to find the right optimizations to take your application to production. Using the MLflow tracking and evaluation APIs, you can log every parameter, base model, performance metric, and model output for comparison. The new Evaluation UI in MLflow makes it easy to compare model outputs side by side and all MLflow tracking and evaluation data is stored in query-able formats for further analysis. Some experiments we commonly see:

Latency - Try smaller models to to reduce latency and cost
Quality - Try fine tuning an open source model with your own data. This can help with domain-specific knowledge and adhering to a desired response format.
Privacy - Try privately hosting the model on Databricks LLM-Optimized GPU Model Serving and using the AI Gateway to fully utilize the endpoint across use cases

Get started developing RAG applications today on Lakehouse AI with MosaicML

The Databricks Lakehouse AI platform enables developers to rapidly build and deploy generative AI applications with confidence.

To replicate the above chat application in your organization, install our complete RAG chatbot demo directly in your workspace use Llama2, MosaicML inference, and Vector Search on the Lakehouse.

Explore the LLM RAG chatbot demo

Further explore and enhance your RAG applications:

Join the Databricks GPU Model Serving Preview to privately host state-of-the-art open source or customized models.
Join the Databricks Vector Search Preview for a serverless, scalable vector similarity search tightly integrated with Lakehouse AI.
Try the Databricks AutoML Embeddings Preview to customize embeddings to your domain-specific private data.
Fine-tune or pretrain your own LLM with MosaicML to own your IP.

What's next?

October 1, 2024/5 min read

Build Compound AI Systems Faster with Databricks Mosaic AI

November 14, 2024/2 min read

Never miss a Databricks post

Sign up

What's next?

Build Compound AI Systems Faster with Databricks Mosaic AI

Providence Health: Scaling ML/AI Projects with Databricks Mosaic AI