Skip to main content
Databricks AI

Accelerating LLM Inference with Prompt Caching for Open‑Source Models on Databricks

Faster, secure OSS LLM inference with prompt caching.

by Pei-Lun Liao, Asfandyar Qureshi, Roshan Regula, Bruce Fontaine, James Thomas and Chenyang Yu

  • Prompt caching reuses repeated prompt prefixes so LLMs run faster. It cuts latency and boosts throughput automatically.
  • Databricks now supports prompt caching for open-source models across batch, pay-per-token, and provisioned workloads. No setup is required.
  • In production on GPT-OSS, prompt caching increased throughput by 2.5x and reduced P50 latency by 3x.

Why Prompt Caching Matters

Large language model (LLM) inference often involves repeated prompts—think of the same system or instruction prompt appearing in thousands of requests. Reprocessing that identical prefix for every call wastes compute cycles, inflates latency, and increases costs.

Prompt caching eliminates this redundancy, providing:

  • Lower latency – the prefill stage can be skipped when the cache is hit.
  • Higher throughput – more tokens are processed per model unit.

Prompt caching can be a powerful technique to raise a model’s quality in specific domains without compromising the model’s token throughput. Queries can share a large domain-specific system prompt, with the compute cost of that shared prompt amortized across all those queries. Frontier models, such as Claude, use system prompts that are many thousands of tokens long under the hood. Furthermore, in our recently published research we showed that automated prompt optimization allows open-source models to surpass frontier-model quality for enterprise tasks.

Feature availability

Databricks already provides built-in prompt caching for proprietary models (GPT, Gemini, Claude). We’ve now extended this capability to the open-weights models powering our Foundation Model APIs (FMAPIs) for batch inference, pay-per-token, and provisioned-throughput workloads. It also applies to any and all higher-level services powered by a foundation model, e.g., Agent Bricks, Genie, AI Functions.

Prompt caching is now supported for the following OSS models hosted on Databricks:

  • GPT‑OSS 20B and 120B
  • Gemma 3 12B
  • Fine-tuned Llama 3.1 8B (via PEFT serving)
  • Llama 3.1 8B and 3.3 70B

We will continue to roll out this feature across our other models. Security is a first‑class concern at Databricks. Prompt caches are isolated, only reside in volatile memory and are never persisted. Importantly, the caching is implicit: customers do not need to configure anything, our system has built to automatically run the prompt caching and reuse to improve throughput.

Real‑World Impact: batch inference on GPT OSS

We rolled out prompt caching to our GPT‑OSS models first and immediately saw measurable gains in one of the large-scale production batch‑inference pipelines:

  • Per‑replica input‑token throughput increased by 2.5x
  • P50 latency reduced by 3x
  • All this with a relatively low cache hit ratio of 30%
Prompt Caching GPT‑OSS Models

Takeaway

By automatically reusing KV caches for identical prompts, Databricks enables you to run open-source LLMs faster, more cost-effectively, and with greater security—all without requiring any additional configuration. Whether you’re serving real‑time chat, batch‑processing large document collections, or building AI agents, prompt caching can turn a good inference pipeline into a great one. Give it a try on your next OSS‑model deployment and watch the performance metrics climb.

Get the latest posts in your inbox

Subscribe to our blog and get the latest posts delivered to your inbox.