Session
Agent as a Judge: Scaling AI Evaluation for the Agentic Era
Overview
| Experience | In Person |
|---|
Agents have arrived. Evaluation workflows need to catch up.The bottleneck is knowing what to evaluate. You can’t catch failures you don’t know exist, and in production agents, they’re subtle: unclear tool definitions, bloated context, brittle handoffs, and inefficiencies that compound across long sessions. Manual trace review is the only way to discover them, but it doesn’t scale. Blind spots accumulate.Agent as a Judge closes this gap. Where LLM-as-a-judge scores predefined criteria, evaluation agents do open-ended discovery. They work directly on raw traces, surface recurring failure modes, and turn those findings into targeted, high-signal evaluators — a step toward agents that improve agents.We’ll show how this works in Phoenix’s open-source evaluation agent, running on Phoenix’s own production traces. You’ll see what it surfaced, how it compares to human reviewers, and what it takes to trust agents with the work of evaluating agents.