Session

Closing the Feedback Loop: Self-Evolving Agent Test Harness with MLflow

Overview

Experience	In Person
Track	Artificial Intelligence & Agents
Industry	Enterprise Technology
Technologies	Agent Bricks
Skill Level	Intermediate

"Most teams iterate on agent quality through vibe-checking: try a question, read the answer, fix what looks wrong, repeat. It works initially, but breaks down as the project grows. Each fix can revive an old bug or introduce a new class of issue. Coding agents add to the strain. They rewrite the agent in seconds while verification still happens by hand.Offline evaluation is one answer, but it is hard. Traditional eval requires setting up golden datasets, metrics, and judges b efore it pays off, and many teams don't get past the setup cost.This session introduces a new approach in MLflow: a self-evolving test harness for agent quality that builds itself from inside the vibe-checking loop. Each piece of feedback on a bad answer becomes an automated test, and every coding-agent fix runs against the accumulated suite.We'll walk through a live demo and share what we've learned."

Session Speakers

IMAGE COMING SOON

Yuki Watanabe

/Senior Software Engineer
Databricks

IMAGE COMING SOON

Daniel Lok

/Software Engineer
Databricks