Creating LLM judges to Measure Domain-Specific Agent Quality
Overview
Experience | In Person |
---|---|
Type | Breakout |
Track | Artificial Intelligence |
Industry | Enterprise Technology |
Technologies | MLFlow, Mosaic AI |
Skill Level | Intermediate |
Duration | 40 min |
Measuring the effectiveness of domain-specific AI agents requires specialized evaluation frameworks that go beyond standard LLM benchmarks.
This session explores methodologies for assessing agent quality across specialized knowledge domains, tailored workflows, and task-specific objectives. We'll demonstrate practical approaches to designing robust LLM judges that align with your business goals and provide meaningful insights into agent capabilities and limitations.
Key session takeaways include:
- Tools for creating domain-relevant evaluation datasets and benchmarks that accurately reflect real-world use cases
- Approach for creating LLM judges to measure domain-specific metrics
- Strategies for interpreting those results to drive iterative improvement in agent performance
Join us to learn how proper evaluation methodologies can transform your domain-specific agents from experimental tools to trusted enterprise solutions with measurable business value.
Session Speakers
IMAGE COMING SOON
Nikhil Thorat
/Software Engineer (GenAI Observability)
Databricks
Samraj Moorjani
/Software Engineer
Databricks