Customers expect instant responses across every interaction, whether it is a recommendation rendered in milliseconds, a fraudulent charge blocked before it clears, or a search result that feels immediate to the user. At scale, delivering these experiences depends on model serving systems that remain fast, stable, and predictable even under sustained and uneven load.
As traffic grows into the tens or hundreds of thousands of requests per second, many teams run into the same set of challenges. Latency becomes inconsistent, infrastructure costs increase, and systems require constant tuning to handle spikes and drops in demand. Failures also become harder to diagnose as more components are stitched together, pulling teams away from improving models and focusing instead on keeping production systems running.
This post explains how Model Serving on Databricks supports high QPS real time workloads and outlines concrete best practices you can apply to achieve low latency, high throughput, and predictable performance in production.
Databricks Model Serving provides a fully managed, scalable serving infrastructure directly within your Databricks lakehouse. Simply take an existing model in your model registry, deploy it, and obtain a REST endpoint on managed infrastructure that is highly scalable and optimized for high QPS traffic.
Databricks Model Serving is optimized for mission-critical high QPS workloads:
Databricks Model Serving empowers our team to deploy machine learning models with the reliability and scale required for real-time applications. It is designed to handle high-QPS workloads while maximizing hardware utilization. On top of that, Databricks provides a SOTA Feature Store solution with super fast lookups needed for such workloads.With these capabilities, our ML engineers can focus on what matters: refining model performance and enhancing user experience. — Bojan Babic, Research Engineer, You.com
With this foundation in place, the next step is optimizing your endpoints, models, and client applications to consistently achieve high throughput and low latency, especially as traffic grows. The following best practices support real customer deployments running millions to billions of inferences every day.
Please see our best practices guide for more details.
A key first step to ensure the network layer is optimized for high throughput/QPS and low latency. Model Serving does this for you through route optimized endpoints. When you enable route optimization on an endpoint, Databricks Model Serving optimizes the network and routing for inference requests, resulting in faster, more direct communication between your client and the model. This significantly decreases the amount of time a request takes to reach the model, and is especially useful for low-latency applications like recommendation systems, search, and fraud detection.
In high throughput scenarios, reducing model complexity, offloading processing from the serving endpoint, and picking the right concurrency targets helps your endpoint scale to large request volumes with just the right amount of compute needed. This way your endpoints are cost efficient but can still scale to hit performance targets.
With Databricks Model Serving, we can handle high-QPS workloads like personalization and recommendations in real time. It gives our brands the scale and speed needed to deliver bespoke content experiences to our millions of readers. — Oscar Celma, SVP of Data Science and Product Analytics at Conde Nast
Optimizing client-side code ensures requests get processed quickly and your endpoint compute instances are fully utilized - leading to better QPS throughput, cost savings and lower latency.
Batch requests together when calling Databricks Model Serving Endpoints
Data Science and ML
June 12, 2024/8 min read
