Session

Databricks on Databricks: Automating Service Reliability at Scale

Overview

ExperienceIn Person
TrackData Engineering & Streaming
IndustryEnterprise Technology
TechnologiesAI/BI, Databricks SQL, Lakeflow
Skill LevelIntermediate
Rapid product delivery makes service reliability a core engineering challenge. This talk shows how we use Databricks on Databricks to build production systems that automate reliability detection and investigation for user-facing services.I’ll cover 2 systems. KPI Miss Auto-Investigator automates KPI regression analysis by detecting dips, ranking error contributors, computing customer blast radius, and syncing annotations across team and global dashboards. Built with Databricks Jobs and ETL pipelines with inter-job dependencies, it cuts on-call investigation time by 67%.User Error Anomaly Detection is a Databricks-based system for logging, detection, and monitoring. User error logs are ingested via Zerobus into Delta tables, analyzed by Jobs using Asset Bundles, and monitored through SQL Alerts and AI/BI dashboards. The system detects incidents up to 3 hours before customer reports, reducing discovery time by 95% and customer impact by 90%, with 95% recall and 98% precision.

Session Speakers

Speaker placeholderIMAGE COMING SOON

Hongwen (Olivia) Song

/Data Scientist
Databricks