Databricks infrastructure engineers are heading to SRECon 2026 in Seattle on March 24th. We are excited to share some of the work we have been doing to scale, operate, and evolve the infrastructure behind the Databricks Platform.
Join us to chat with engineers from our infrastructure teams, including Bricksters working on service mesh, traffic routing, configuration management, and running stateful services. This is a great opportunity to explore the biggest problems that engineers are solving and the infrastructure innovations they’re driving.
Plus, don’t miss these technical sessions!
Databricks runs thousands of microservices across AWS, Azure, and GCP. At this scale, Kubernetes' default load balancing breaks down. The built-in kube-proxy and ClusterIP model operates at Layer 4, distributing connections rather than requests. For gRPC services with long-lived HTTP/2 connections, this leads to severe traffic skew: some pods get overwhelmed while others sit idle. The result is tail latency spikes, wasted compute, and unpredictable service behavior.
We built a custom solution to address this and in this talk we'll walk through the architecture, the tradeoffs we considered (including why we chose not to adopt Istio or a full service mesh), and the lessons we learned rolling this out across a multi-cloud fleet.
For more technical details, see our earlier blog post: Intelligent Kubernetes Load Balancing at Databricks.
Databricks operates thousands of OLTP database instances across three clouds and hundreds of regions. When something goes wrong, engineers historically had to stitch together signals from Grafana dashboards, CLI tools, cloud provider consoles, and internal runbooks. The debugging experience was fragmented, slow, and heavily dependent on tribal knowledge. New engineers could take weeks to become effective at diagnosing database issues.
We built an AI-assisted platform to change this; starting from a hackathon prototype and growing it into a production system. In this talk, we'll share the journey from zero to production, the architectural decisions that made it work, and what we've learned about building AI-powered operational tooling at scale.
For more details, see our earlier blog post: How We Debug 1000s of Databases with AI at Databricks.
Earlier this year, we open-sourced Dicer, our auto-sharding system for building highly available, low-latency sharded services. Dicer addresses a fundamental tension in distributed systems: stateless architectures are simple but expensive (every request hits the database or remote cache), while statically sharded architectures are efficient but fragile (restarts cause availability dips, hot keys cause imbalance, and scaling requires manual intervention).
Dicer solves this by continuously and dynamically managing shard assignments. It splits overloaded shards, merges underutilized ones, replicates critical data for availability, and moves shards during rolling restarts to maintain cache hit rates. At Databricks, Dicer powers some of our most critical services: Unity Catalog achieves 90-95% cache hit rates with Dicer, our SQL query orchestration engine eliminates availability dips during restarts, and our remote cache maintains hit rates even through rolling deployments.
We are hosting a dedicated networking event during SRECon where we will go deeper on Dicer: how it works, how we use it in production, and how you can use it in your own infrastructure. This is an interactive session over drinks and appetizers, not a formal talk. Bring your questions about sharding, caching, and building stateful services at scale.
Space is limited. Register here: Databricks Networking Event @ SRECon 2026
Beyond the talks and the networking event, our infrastructure teams are tackling some of the hardest problems in multi-cloud operations. A few areas we are excited about:
Multi-cloud service delivery: Databricks runs on AWS, Azure, and GCP simultaneously. Every service, every configuration, every deployment pipeline needs to work across all three clouds and their respective government and sovereign regions. Our teams are building the tooling and abstractions that make this manageable, from unified placement configurations that define where services run, to deployment pipelines that handle the differences between cloud providers.
Service mesh and traffic routing: As our service fleet grows, routing traffic efficiently and reliably becomes increasingly complex. We are investing in service discovery, cross-cluster and cross-region routing, and integration between our load balancing and sharding systems. As our fleet has grown, the problem space has expanded from optimizing traffic within a single cluster to routing across clusters, across regions, and even across cloud providers.
Configuration management at scale: Managing configuration across thousands of services, multiple clouds, and different environments (dev, staging, production, government regions) is a problem that compounds with every new service and every new region. Our teams are building systems to make configuration changes safe, auditable, and consistent. See our blog post on High-Availability Feature Flagging at Databricks.
Databricks is a Silver Sponsor. Find us at Booth #214 on the Expo Floor. Several engineers from our infrastructure teams will be there, including Bricksters working on service mesh, traffic routing, configuration management, and running stateful services. Come find us to chat about the problems we are solving and the systems we are building.
If you miss us at SREcon and are interested in joining our team, visit our Careers site for the latest opportunities.
