You'd never let a coding assistant refactor your codebase without a test suite. Without tests, the assistant flies blind. It might fix one function and silently break three others. The tests are what close the loop: run them, observe failures, fix the code, run them again. No tests, no confidence.
At Databricks we continuously develop and deploy agents that cover a wide range of functionality, from new features in the Databricks platform (e.g., the data-engineering, trace analysis, and machine learning capabilities in Genie Code), to OSS projects (e.g., the MLflow assistant), to internal engineering workflows (e.g., on-call support or automated code reviewers). These agents can perform long-running tasks, generate thousands of lines of code, and create new data and AI assets among other things. While we had some basic checks in place early on, we lacked the kind of comprehensive, automated test suite that would let us iterate with confidence. This post describes how we closed that gap using MLflow, and the best-practices coSTAR (coupled Scenario, Trace, Assess, Refine) methodology we built around it. coSTAR runs two coupled loops: one that aligns judges with human expert judgment so they can be trusted, and one that uses those trusted judges to automatically refine the agent until it passes all test scenarios.

Figure: The coSTAR framework runs two mirrored STAR loops (Scenario → Trace → Assess → Refine) . The agent loop (blue) uses judges to auto-score traces and refines the agent to align with judges. The judge loop (orange) uses human experts to score traces and refines the judges to align with their assessments. Both loops share the same scenarios and traces.
Early on, our development loop looked like this: run the agent, manually review its output, spot a flaw, tell a coding assistant to fix it. Repeat.
If this reminds you of writing code without tests and manually QA-ing every change, that's exactly what it was. And it failed in exactly the way you'd predict. The obvious reaction is "so write tests." But agent testing is structurally different from testing a deterministic function, and several challenges compound at once:
These constraints shaped every design decision that follows. They're also what makes this problem interesting: we're not just building a test runner, we're building a automated optimization methodology for stochastic, long-running, multi-step processes where "correct" is a judgment call.
If you squint, agent development maps cleanly onto the dev loop that every engineer already knows:
| Traditional software | Agent development |
|---|---|
| Source code | Agent implementation (including prompts, choices of FMs, tools) |
| Test suite | LLM judges |
| Test fixtures (setup, input, expected output) | Scenario definitions (initial state, prompt, expectations) |
| Test runner / harness | Test harness executes the agent under test, produces traces |
| Test correctness (do tests check the right thing?) | Judge alignment (does the judge agree with human experts?) |
| Coding assistant fixes code until tests pass | Coding assistant refines implementation until judges pass |
| CI runs all tests on every change | CI runs scenarios + judges on every change |
| Production monitoring | Same judges run on live traffic |
This analogy isn't just illustrative. It's the literal architecture of our system, which we call coSTAR: two coupled loops that use Scenario definitions as test fixtures, Trace capture as the test harness, Assess with judges as the test suite, and Refine as the red-green loop. Let's walk through each piece.
In traditional testing, a test fixture sets up the preconditions: create a database, seed it with data, configure the environment. Our equivalent is a scenario definition: a structured description of the initial state, the user prompt, and the expected outcomes.
Here's a simplified scenario for testing a Data Analyst agent against a messy dataset:
Each scenario bundles the setup, the input, and the success criteria in one place, just like a test fixture. We maintain a suite of these across different agents, covering common cases, edge cases, and known past failures. The suite grows over time as we discover new failure modes: every bug we find in production becomes a new scenario, the same way every production bug should become a regression test.
Why bother with this structure? Because agent runs are expensive. A single scenario takes minutes to execute. We need to be deliberate about what we test, and we need the scenario definitions to be portable: the same scenario can run against different agent implementations or different versions of the same agent.
To run our test suite, we use a harness that sends each scenario's prompt to the agent under test (AUT). Each execution is captured as a MLflow trace: a structured log of every tool call, every intermediate output, and every artifact the agent produces. Think of it as a flight recorder: it captures everything the agent did, in order, so we can inspect any part of the execution after the fact.
A key architectural decision: we decouple execution from scoring. The test harness produces traces; the judges (which we'll introduce next) score them. These are separate steps. By persisting traces, we can iterate on judges without re-running scenarios. Adjust a threshold? Re-score the recorded traces in seconds. Add a new judge? Run it against every trace you've ever collected. Suspect a judge is wrong? Compare its verdicts against the recordings and debug it offline. One expensive agent run produces data that gets reused many times, including as candidates for the Golden Set we'll use to align judges later.
Judges operate on traces and reason about properties of the execution: did the agent produce valid code? Did the output meet a quality threshold? Did the agent follow the right process? As mentioned earlier, this evaluation is different from traditional unit tests: agent output is non-deterministic and rich, and so asserting exact outputs is essentially useless.
The standard approach to implementing these judges is "LLM-as-a-Judge": feed the full trace to a model and ask for a score and equally importantly a rationale for that score. However, that's like writing a test that dumps the entire program state into an assertion. It's expensive, fragile, and hard to debug. For our agents, a single trace can be thousands of lines long. Stuffing it into a judge's context window degrades judgment quality.
Instead, we use MLflow’s agentic judges: judges that are themselves agents, equipped with tools to explore the trace selectively. Just like a well-written test calls a specific function and checks a specific return value, an agentic judge calls a specific tool on the trace and checks a specific property.
Here are some example judges that we have used across our agents:
Skill invocation judge explores the trace and identifies whether the agent invoked skills that are targeted by the scenario (if not, then the skill’s purpose is not clear to the AUT):
Best-practices judge explores whether the output follows best practices according to Databricks official documentation:
Outcome Judge inspects the trace for output assets and asserts certain properties. Going back to the Data Analyst example, identify the part of the trace where engineering code was authored and evaluate whether the code is appropriate for the task at hand:
This judge is interesting because it tackles the subjective quality problem head-on: what counts as good feature engineering depends on domain expertise. An LLM judge can't get this right out of the box. It's tempting to try writing out the complete criteria in the judge's prompt: "prefer median imputation over mean for skewed distributions, always scale features before distance-based models, ..." But encoding a domain expert's full judgment into a prompt is laborious and brittle. It's much easier for humans to look at an example and say "this is good" or "this is bad" than to write out the complete spec. This is exactly why alignment works, as we'll cover shortly.
In general, our test suite for a single agent includes judges across several categories:
Deterministic checks, things we can verify mechanically, no LLM needed:
LLM-based checks, judgment calls that require understanding context:
Operational metrics, signals that don't pass/fail individually but track health over time:
The operational metrics deserve a note. They don't gate a release the way pass/fail judges do, but they're critical for cost management and early warning. If token usage doubles after a change, something went wrong even if all judges still pass; the agent is probably doing more work than it should. We track these over time and alert on anomalies.
Test suites don't get authored in one sitting. They evolve over time. They start with the simplest checks that give a signal: does the output exist? Does it parse? Then structural checks follow: does the output have the right schema, the right columns, the right types? Only later come end-to-end data validation judges: does the output actually produce correct results when you run it?
This mirrors how test suites mature in traditional software. Exhaustive integration tests don't come on day one. It starts with smoke tests, then unit tests as failure modes emerge, building toward end-to-end coverage over time. The key is that the infrastructure supports adding new judges cheaply, so the test suite grows alongside the agent.
Here's a problem every engineer knows: a flaky or wrong test suite that greenlights bad code ships bugs with confidence. Similarly, judges who approve poor outcomes give a false sense of security. This is where the second loop of the coSTAR framework comes in: the same scenarios and traces that drive agent refinement also drive judge refinement, with human expert scores as the ground truth. This matters because, unlike traditional testing where test correctness can be verified by inspection, LLM judges are stochastic and can drift in how they interpret natural-language criteria. So we need a way to verify them and keep them aligned with human experts.
To do this alignment, we first curate a Golden Set of typically dozens of examples of agent outputs that our engineers have manually assessed. This is the ground truth the judges must agree with. Then we leverage MLflow's alignment capabilities (powered by techniques like GEPA and MemAlign) to automatically refine the judge against the Golden Set. Notice this is structurally the same STAR loop we use to refine the AUT itself, but the assess step is performed by human experts and the refine step applies to the judge.
With judges that the judge loop has aligned against human expert judgment, we can now trust the agent loop. A coding assistant treats the agent as its codebase and the judges as its test suite. It reads failures, diagnoses root causes, patches the agent, and re-runs everything. The engineer is still the reviewer and final arbiter of the proposed changes to the agent, but this automated iteration saves considerable human effort in analyzing and improving the agent.
Here's what one iteration looked like for the Data Analyst agent:
Red. We ran the initial version of the agent against our scenario suite. The best-practices judge flagged a discrepancy: our agent was generating code for logical views that was different from our official recommendations/documentation. While this discrepancy would not affect correctness, it had implications on the maintenance and deployment of the generated code. This is an example of an insidious regression that would be hard to catch by manual investigation.
Green. The coding assistant analyzed the judge feedback and identified the gap: the agent was using a skill that was not prescriptive about the type of views that should be created (temporary vs permanent). After adding the relevant guidance to the skill, the tests passed successfully and the change was verified to not introduce other regression (based on other test scenarios).
So far we've described judges as tests for the agent, catching regressions when the agent implementation changes. But in practice, the agent itself isn't the only thing that changes. The agent depends on external tools and infrastructure, and those change too.
Our agents call MCP tools, standardized interfaces for data access, code execution, environment setup, and more. These tools have their own development teams and release cycles. When a tool changes its implementation (say, a code execution tool starts returning stderr in a different format, or a data access tool changes how it handles null values) the agent hasn't changed at all, but the agent's behavior can break.
Because we run our judges on every nightly build, they act as regression tests against the full stack, not just the agent’s current implementation. When a tool team ships a change that causes an agent to start failing its judges then we catch the error immediately, before it reaches customers. More importantly, the judge's failure tells us what broke (the specific quality dimension that regressed), which makes it far easier to triage whether the root cause is in the agent or in a tool the agent depends on.
This is the same value that integration tests provide in traditional software: they guard the contract between the code and its dependencies. The only difference is that here, the "code" is an agent and the "dependencies" are MCP tools.
There's one more extension of the testing analogy that turned out to be surprisingly valuable: running the same judges on production traffic.
In traditional software, testing doesn't stop at CI. Production gets monitored too: error rates, latency percentiles, business metrics on live traffic. The same test logic that validates code in dev often reappears as health checks and alerts in prod.
We do the same thing. The judges we built for eval are designed to score any agent conversation, not just eval scenarios. So we run them (or a sampled subset) on real production conversations. This gives us:
The key insight is that the same scoring infrastructure (judges, metrics, recorded traces) serves double duty. Build it once for eval, and production monitoring comes as a side effect.
We have adopted this methodology across several agents that we have released in the Databricks platform (e.g., the Data Engineering, Machine Learning, and Trace analysis capabilities in Genie), internal agents for developer productivity, as well as other customer-facing agents (e.g., AI Dev Kit, or the OSS MLflow Assistant). Overall we have seen tangible benefits:
MLflow has also been instrumental as a GenAI testing platform, helping our engineers standardize on the methodology, accelerate the development of tests, and share best practices across teams.
The testing analogy is useful here too. Our limitations map onto familiar testing problems:
Scenario generation is manual (writing test cases is expensive). We've automated scoring, alignment, and optimization, but generating the scenarios themselves is still a human task. Each scenario requires crafting realistic initial state, a meaningful prompt, and correct expectations. This is the bottleneck that limits test suite size, and a narrow test suite leads directly to the next problem. Automating scenario generation (synthesizing diverse, realistic test cases from production traffic patterns or from the agent’s specification) is an active area of work for us.
The coding assistant can overfit (test suite too narrow). If the test suite doesn't cover enough cases, the coding assistant will engineer an agent implementation that aces those specific inputs but fails on novel ones. This is the agent equivalent of writing code that passes unit tests but breaks in production. We mitigate this by feeding production failures back into eval and expanding coverage over time, but until scenario generation is automated, the test suite grows slower than we'd like.
Judge alignment is expensive (calibrating tests requires human labor). Building the Golden Set requires domain experts to manually grade outputs, the exact bottleneck we're trying to eliminate. And it's not a one-time cost: as agents evolve, judges need recalibration. We're investigating ways to make this smarter by measuring judge uncertainty, identifying the specific examples where the judge is underspecified and a human label would actually resolve ambiguity. The goal is active learning for judge alignment: instead of asking experts to grade a random sample, surface only the examples where the judge is uncertain and a domain expert's input would sharpen its criteria the most.
Multi-step failures are hard to attribute (root cause analysis). When an agent fails at step 7 of a 10-step pipeline, was the root cause at step 7 or step 3? Our judges catch the symptom but the coding assistant sometimes patches the wrong step, like fixing a test failure by changing the wrong function. Better causal tracing is an active area of work.
Novel failure modes slip through (coverage gaps). coSTAR optimizes within the dimensions the judges cover. If a new class of failure emerges that no judge checks for, it's invisible, just like a bug in code that no test exercises. coSTAR improves within its test suite, but it can't expand the test suite on its own. Humans still need to notice new failure modes and add judges.
We're building coSTAR as part of MLflow. If you're tackling similar problems, we'd love to hear about it.
