Session

How to Evaluate Your Agents From First Benchmark to CI/CD

Overview

ExperienceIn Person
TrackArtificial Intelligence & Agents
IndustryEnterprise Technology, Communications, Media & Entertainment
TechnologiesAI/BI, Agent Bricks, Lakebase
Skill LevelIntermediate
If your team is building agents, evaluation is the difference between a demo that works and a product you can actually rely on. But most teams don't know where to start — which metrics matter, how to build a benchmark, how to automate tests, or how to catch regressions before they reach users.In this how-to session, we walk through agent evaluation end-to-end: defining your first benchmark, choosing the right judges and metrics for your use case, automating evaluation across prompts and workflows, and building a CI/CD loop that catches regressions before they hit production. We'll share practical patterns from real production teams and show how MLflow ties it all together.You'll leave with a concrete playbook for evaluating agents from first test to continuous quality at scale — and the confidence to ship agents that keep getting better with every iteration.

Session Speakers

Max Fisher

/Lead Solutions Architect
Databricks

Speaker placeholderIMAGE COMING SOON

Wenwen Xie

/Specialist Solutions Architect
Databricks