Session

AI_DIAGNOSE: Automating Spark Job Debugging With LLM Agents

Overview

Experience	In Person
Track	Artificial Intelligence & Agents
Industry	Enterprise Technology
Technologies	Databricks SQL
Skill Level	Intermediate

Debugging distributed Spark jobs is notoriously difficult, often requiring hours of sifting through massive logs to find root causes. We introduce AI_DIAGNOSE, a new Spark SQL procedure in Databricks Runtime that automates this process using large language models (LLMs).

This session explores the architecture of AI_DIAGNOSE, detailing how we implemented a ReAct-style agent within the Spark engine. We will demonstrate how the agent autonomously gathers context, retrieves distributed logs, and analyzes job failures. We'll cover technical challenges like managing LLM context windows with large logs, ensuring structured actionable outputs, and leveraging Databricks Model Serving for secure inference. Join us to learn how AI_DIAGNOSE works under the hood to drastically reduce Mean Time To Resolution (MTTR) for data pipelines.

Session Speakers

IMAGE COMING SOON

Allison Wang

/Staff Software Engineer
Databricks

IMAGE COMING SOON

Shujing Yang

/Software Engineer
Databricks