Skip to main content
Technology

Using observability data to prevent incidents

Industry Outcomes: SRE teams are excellent at responding to incidents. The data that would reduce incident frequency is sitting in logs and metrics that nobody has time to interrogate proactively.

by Madelyn Mullen

  • Engineering teams are reactive due to slow access to observability data, limiting their ability to anticipate and prevent incidents.
  • Current metrics optimize response (MTTR) but fail to surface upstream reliability risks that impact revenue, roadmap velocity, and customer trust.
  • Databricks Genie enables natural language querying of telemetry data, allowing leaders to proactively identify risks and shift from reactive firefighting to reliability intelligence.

USE CASE
Platform Reliability Intelligence & Engineering Metrics

How Engineering Teams Use Observability Data to Prevent Incidents

Engineering teams use observability data to prevent incidents by continuously monitoring signals and interrogating that data proactively to identify accumulating risk before it triggers a user-facing failure. The signals may include error rate trends, latency percentiles, deployment frequency, SLO burn rates, and others relevant to your service. The shift from reactive incident response to proactive reliability intelligence requires two things: unified access to telemetry data across services, and a way to query that data at the pace of engineering decisions. When engineering leaders can ask "which services are approaching their error budget threshold at current burn rates?" and get an answer in seconds rather than days, they can make mitigation decisions before the incident occurs. Proactive approaches protect both uptime and the R&D capacity that would otherwise be spent on emergency response.

Your engineering organization is not reactive by choice. It's reactive by architecture. You have the observability data: metrics, logs, traces, error budgets, SLO burn rates. You have the instrumentation. What you don't have is a way to ask questions of that data at the pace engineering decisions actually require. By the time the question can be answered, the incident is already in progress.

That's not an on-call problem. It's a data access problem, and it's the gap most engineering organizations haven't named yet.

Every unplanned incident has a business cost: engineering time pulled from roadmap work, customer trust eroded, SLA exposure, and support volume that spikes downstream. Reliability isn't an engineering hygiene problem. It's a revenue protection and R&D efficiency problem, and it deserves the same analytical rigor as any other business function.

As Chase Holland, Lead Principal Software Engineer at The Trade Desk, puts it: “The most expensive part of building a product isn’t writing the code anymore... It’s deciding what to do. The better data you can get on what you should be doing, the better and faster decisions you can make.” In a reliability context, that means using data to decide which risk to mitigate on Monday, so you aren't writing emergency patches on Saturday.

Modern observability platforms are optimized for incident response: alert on breach, diagnose, remediate. They are not designed to answer the upstream question a VP of Engineering actually needs answered: which parts of the system are accumulating reliability risk that will manifest as incidents in the next 30–60 days? Answering that requires interrogating error rate trends, latency percentile trends, and capacity utilization trends across dozens of services — without waiting on a data request queue. The signals exist. The engineering leader's ability to read them proactively does not.

What Is Reliability Intelligence? (And How It Differs from Observability)

Reliability intelligence is the practice of using telemetry data to proactively identify reliability risk before it manifests as a user-facing incident. Things like metrics, logs, traces, error budgets, and deployment records differs from traditional observability in one critical way: observability tells you what is happening right now; reliability intelligence tells you what is likely to happen in the next 7–30 days based on trend analysis across your service portfolio. An organization practicing reliability intelligence doesn't wait for an SLO breach alert. It identifies that a service's error budget is burning at twice the normal rate on a Tuesday morning and decides how to respond before the weekend on-call rotation feels it.

Why Observability Data Isn’t Preventing Incidents Yet

Engineering leaders in high-scale systems track the right metrics: MTTR (mean time to resolve), incident frequency, SLO adherence by service. Those metrics tell you how well your team responds. They don't tell you what's coming. What's structurally missing is the upstream question: where is reliability risk accumulating before it becomes a page, and what is that risk costing the business in developer time, roadmap capacity, and customer confidence?

The data to answer that question exists in your telemetry. It is not in a form that engineering leaders can query without specialized tooling or analyst support. Your SRE team is excellent at responding. They are not resourced to proactively interrogate hundreds of services' worth of trend data on a weekly basis. So the signals accumulate. The incident happens. MTTR improves because your team is practiced. Incident frequency doesn't because the analysis that would reduce it never ran. And every incident that didn't have to happen is R&D capacity that got spent on fire-fighting instead of shipping.

The issue of the week is, one of our product lines has growth that's slowing down and we're trying to figure out why. It's very difficult to get the insights out and know if you can trust them when you get them. — A VP of Product at an AI-Native platform

The data access problem compounds into a data trust problem. The framing holds for engineering organizations at any scale: reactive diagnosis is the default because proactive interrogation of reliability data is structurally difficult because the upstream analysis that would reduce it requires data access that engineering leaders don't have on demand. And even when you get an answer, you're not always sure it's right. MTTR improves. Incident frequency doesn't.

Without this immediate access, reliability meetings often devolve into what Holland calls "opinion-based negotiations." When teams lack a single, trusted source of truth for their operational data, they spend weeks debating the cause of a trend rather than fixing it. By shifting to a self-service model, a global advertising technology leader like The Trade Desk has turned those weeks of debate into quick, verified resolutions, allowing their teams to move with much higher intent.

How Databricks Genie Turns Observability Data Into Proactive Incident Prevention

Databricks Genie enables engineering leaders to interrogate their operational telemetry data in natural language. A VP of Engineering can ask: 'Which services have shown p99 latency increases greater than 20% over the past 14 days, and what's their dependency overlap with the services that had incidents in Q2?' That question surfaces from your actual engineering data in seconds, not days.

The follow-on questions become natural. "Which services are approaching their error budget threshold at current burn rates, and when do we expect breach?" Or: "What's the correlation between deployment frequency and incident rate across my highest-traffic services in Q3?" This capability isn't limited to simple datasets. To maintain visibility across a massive environment of over 10,000 tables, The Trade Desk built a "Genie Router" that automatically directs questions to the right data environment. This allows them to maintain a single interface for their teams while handling a level of technical complexity that would overwhelm a standard dashboard. Each answer draws from your actual telemetry, deployment records, and incident history and becomes queryable directly by any engineering leader, without translating the question for an analyst first.

For an engineering leader whose reliability commitments are also business commitments — SLA exposure, customer trust, and the R&D capacity consumed by incidents — that interrogation speed is the difference between proactive risk management and reactive fire-fighting. Your error budget isn't just a technical metric; it's a business resource. Genie lets you manage it like one. The reliability signal that would have justified a mitigation sprint surfaces before the incident, not during it.

Three Steps to Move from Reactive Incident Response to Proactive Reliability Intelligence

Step 1 — Centralize your telemetry. Bring metrics, logs, traces, deployment records, and incident history into a unified data environment. Fragmented tooling is the primary reason proactive analysis doesn't happen: engineers can't answer cross-service questions when each service's data lives in a different system.

Step 2 — Define leading indicators, not just lagging ones. MTTR and incident frequency measure what already happened. Leading indicators measure what's about to happen. Teams that track SLO burn rate trajectory, p99 latency trend, error budget remaining at current consumption rate along with lagging indicators can intervene before the page fires.

Step 3 — Enable self-service querying for engineering leaders. The analysis that would reduce incident frequency rarely runs because it requires analyst support and a 48-hour wait. When engineering leaders can query their own reliability data in natural language — asking "which services have the highest correlation between deployment frequency and incident rate this quarter?" — proactive risk management becomes a weekly habit, not a quarterly exercise.

How to Shift from Incident Response to Reliability Intelligence: A Practical Framework

The engineering organizations that sustain high reliability in complex, high-scale systems are the ones that can interrogate their operational data proactively and find the signals of accumulating risk before they manifest as user-facing incidents. That requires data access designed for the pace of engineering decision-making, not for the pace of analyst query queues.

The insight-to-action cycle in platform reliability is measured in hours and days, not sprint cycles. When an engineering team can identify a p99 latency trend on Monday morning and decide on a mitigation approach before standup, they're operating on reliability intelligence rather than incident response. When that same question requires a data request and a 48-hour wait, the incident happens first.

For engineering teams with customer-facing SLAs, that speed has direct business consequence. Ad-hoc analysis runs 5x faster with Genie, meaning the reliability question that would have waited two days for an analyst runs before standup.

DATABRICKS GENIE · KEY DIFFERENTIATORS
Built for your data, governed by your rules, answerable to any engineering leader.

  • Telemetry data integration: Metrics, logs, and traces from your observability platform alongside deployment and incident records in a unified environment.
  • Service dependency awareness: Genie understands your service graph — questions about dependency risk are answerable across your actual architecture.
  • DORA metrics: Deployment frequency, lead time, MTTR, and change failure rate are queryable conversationally — no dashboard required for engineering performance discussion.
  • Capacity and cost integration: Infrastructure cost data alongside performance data — reliability and efficiency decisions get integrated context.

Frequently Asked Questions

Q: What is the difference between observability and monitoring for incident prevention?
Answer: should distinguish reactive monitoring (alerts when something breaks) from observability (understanding system state well enough to predict failures), in 2–3 sentences.

Q: Which observability signals are most predictive of upcoming incidents?
Answer: should name SLO burn rate, p99 latency trend, and deployment-correlated error rate as the three most actionable leading indicators — keep it to 2–3 sentences.

Q: How does Databricks Genie help SRE teams prevent incidents? 
Answer: should connect Genie's natural language querying capability to the specific use case of proactive trend interrogation — pull from existing draft copy.

Q: How long does it take to shift from reactive incident response to proactive reliability intelligence?
Answer: should be honest and practical: centralization and tooling typically takes weeks; cultural shift to proactive querying takes 1–3 months with the right self-service access.

Q: What DORA metrics should engineering teams track to improve reliability? 
Answer: should name the four DORA metrics (deployment frequency, lead time, MTTR, change failure rate) and note that change failure rate and MTTR together are the strongest reliability predictors.

See What Genie Can Do for Your Team

If your MTTR keeps improving but incident frequency doesn't, the gap isn't execution — it's proactive data access. Reliability is an R&D efficiency problem. See how engineering leaders are using AI/BI Genie to manage it like one.

Get the latest posts in your inbox

Subscribe to our blog and get the latest posts delivered to your inbox.