Session
AI_DIAGNOSE: Automating Spark Job Debugging with LLM Agents
Overview
| Experience | In Person |
|---|---|
| Track | Artificial Intelligence & Agents |
| Industry | Enterprise Technology |
| Technologies | Databricks SQL |
| Skill Level | Intermediate |
Debugging distributed Spark jobs is notoriously difficult, often requiring hours of sifting through massive logs to find root causes. We introduce AI_DIAGNOSE, a new Spark SQL procedure in Databricks Runtime that automates this process using Large Language Models (LLMs).This session explores the architecture of AI_DIAGNOSE, detailing how we implemented a ReAct-style agent within the Spark engine. We will demonstrate how the agent autonomously gathers context, retrieves distributed logs, and analyzes job failures. We'll cover technical challenges like managing LLM context windows with large logs, ensuring structured actionable outputs, and leveraging Databricks Model Serving for secure inference. Join us to learn how AI_DIAGNOSE works under the hood to drastically reduce Mean Time To Resolution (MTTR) for data pipelines.
Session Speakers
Allison Wang
/Staff Software Engineer
Databricks
Shujing Yang
/Software Engineer
Databricks