Building a Production Scale, Totally Private, OSS RAG Pipeline with DBRX, Spark, and LanceDB

Register or Login


TRACKGenerative AI
TECHNOLOGIESAI/Machine Learning, Apache Spark, GenAI/LLMs
SKILL LEVELIntermediate

Enterprises wanting to bring AI to production face obstacles around data security. Data often has to be shipped off to a hosted LLM and hosted embedding model. Generated vectors are often stored in a hosted vector database. In addition, these hosted vector databases don’t support bulk ingestion, making it difficult to load production-scale data quickly. Instead, the most recent release of DBRX represents a significant breakthrough in the quality of open source models and gives enterprises a viable option to have high quality gen AI responses in a self-hosted model. For memory, LanceDB is OSS and supports real-time serving for Billion-scale embedding datasets with much lower resources than alternatives. Under the hood, LanceDB stores data in Lance columnar format and large-scale updates can be written in minutes via Lance’s Spark DataSource. The same dataset can be used both for offline analytics / EDA / training, and also for online serving in LanceDB for AI retrieval in service of RAG, agents, and more. LanceDB’s embedding function registry can be extended to target custom embedding models served from MLFlow that don’t require sending data off premises.


By combining Spark, DBRX, and LanceDB, you can create your own completely private generative AI pipeline without having to leave your comfortable lakehouse.


Chang She

/CEO / Co-founder

Jasmine Wang

/Head of Ecosystem Engagement