Session
Databricks on Databricks: Automating Service Reliability at Scale
Overview
| Experience | In Person |
|---|---|
| Track | Data Engineering & Streaming |
| Industry | Enterprise Technology |
| Technologies | AI/BI, Databricks SQL, Lakeflow |
| Skill Level | Intermediate |
Rapid product delivery makes service reliability a core engineering challenge. This talk shows how we use Databricks on Databricks to build production systems that automate reliability detection and investigation for user-facing services.I’ll cover 2 systems. KPI Miss Auto-Investigator automates KPI regression analysis by detecting dips, ranking error contributors, computing customer blast radius, and syncing annotations across team and global dashboards. Built with Databricks Jobs and ETL pipelines with inter-job dependencies, it cuts on-call investigation time by 67%.User Error Anomaly Detection is a Databricks-based system for logging, detection, and monitoring. User error logs are ingested via Zerobus into Delta tables, analyzed by Jobs using Asset Bundles, and monitored through SQL Alerts and AI/BI dashboards. The system detects incidents up to 3 hours before customer reports, reducing discovery time by 95% and customer impact by 90%, with 95% recall and 98% precision.
Session Speakers
Hongwen (Olivia) Song
/Data Scientist
Databricks