Faster, secure OSS LLM inference with prompt caching.
by Pei-Lun Liao, Asfandyar Qureshi, Roshan Regula, Bruce Fontaine, James Thomas and Chenyang Yu
Large language model (LLM) inference often involves repeated prompts—think of the same system or instruction prompt appearing in thousands of requests. Reprocessing that identical prefix for every call wastes compute cycles, inflates latency, and increases costs.
Prompt caching eliminates this redundancy, providing:
Prompt caching can be a powerful technique to raise a model’s quality in specific domains without compromising the model’s token throughput. Queries can share a large domain-specific system prompt, with the compute cost of that shared prompt amortized across all those queries. Frontier models, such as Claude, use system prompts that are many thousands of tokens long under the hood. Furthermore, in our recently published research we showed that automated prompt optimization allows open-source models to surpass frontier-model quality for enterprise tasks.
Databricks already provides built-in prompt caching for proprietary models (GPT, Gemini, Claude). We’ve now extended this capability to the open-weights models powering our Foundation Model APIs (FMAPIs) for batch inference, pay-per-token, and provisioned-throughput workloads. It also applies to any and all higher-level services powered by a foundation model, e.g., Agent Bricks, Genie, AI Functions.
Prompt caching is now supported for the following OSS models hosted on Databricks:
We will continue to roll out this feature across our other models. Security is a first‑class concern at Databricks. Prompt caches are isolated, only reside in volatile memory and are never persisted. Importantly, the caching is implicit: customers do not need to configure anything, our system has built to automatically run the prompt caching and reuse to improve throughput.
We rolled out prompt caching to our GPT‑OSS models first and immediately saw measurable gains in one of the large-scale production batch‑inference pipelines:

By automatically reusing KV caches for identical prompts, Databricks enables you to run open-source LLMs faster, more cost-effectively, and with greater security—all without requiring any additional configuration. Whether you’re serving real‑time chat, batch‑processing large document collections, or building AI agents, prompt caching can turn a good inference pipeline into a great one. Give it a try on your next OSS‑model deployment and watch the performance metrics climb.
Subscribe to our blog and get the latest posts delivered to your inbox.