Session

Stop Guessing, Start Scoring: LLM Extraction Evaluation on Databricks

Overview

Experience	In Person
Track	Artificial Intelligence & Agents
Industry	Consulting & Services
Technologies	Databricks Apps, Databricks Agents, Lakebase
Skill Level	Intermediate

Speakers: Michelle JanneyCoyle (Databricks), Darshana Nair (CLA)Robust evaluation is key to building reliable LLM-based systems. Together with CliftonLarsonAllen (CLA), a leading accounting and professional services firm, we developed a Databricks-native evaluation solution. This system centralizes metrics, ground truth, and expert feedback to enable repeatable offline evaluation for their SOC extraction workflow.In this talk, we deconstruct our implementation, highlighting how we use managed MLflow, custom scorers, and a Databricks App with Lakebase to capture SME feedback and structured ground truth. During this talk you will learn how this framework makes quality measurable and provides clear signal for future improvements. Key TakeawaysAutomating evaluation using Databricks tools.Using Databricks Apps to bridge the gap between Data Science and domain experts.Lessons from evaluating complex semantic extraction and structured JSON outputs.

Stop Guessing, Start Scoring: LLM Extraction Evaluation on Databricks

Overview

Session Speakers

Darshana Nair

Michelle JanneyCoyle