Session
How to Evaluate Your Agents From First Benchmark to CI/CD
Overview
| Experience | In Person |
|---|---|
| Track | Artificial Intelligence & Agents |
| Industry | Enterprise Technology, Communications, Media & Entertainment |
| Technologies | AI/BI, Agent Bricks, Lakebase |
| Skill Level | Intermediate |
If your team is building agents, evaluation is the difference between a demo that works and a product you can actually rely on. But most teams don't know where to start — which metrics matter, how to build a benchmark, how to automate tests, or how to catch regressions before they reach users.In this how-to session, we walk through agent evaluation end-to-end: defining your first benchmark, choosing the right judges and metrics for your use case, automating evaluation across prompts and workflows, and building a CI/CD loop that catches regressions before they hit production. We'll share practical patterns from real production teams and show how MLflow ties it all together.You'll leave with a concrete playbook for evaluating agents from first test to continuous quality at scale — and the confidence to ship agents that keep getting better with every iteration.
Session Speakers
Max Fisher
/Lead Solutions Architect
Databricks
Wenwen Xie
/Specialist Solutions Architect
Databricks