Session

Creating LLM judges to Measure Domain-Specific Agent Quality

Overview

ExperienceIn Person
TypeBreakout
TrackArtificial Intelligence
IndustryEnterprise Technology
TechnologiesMLFlow, Mosaic AI
Skill LevelIntermediate
Duration40 min

Measuring the effectiveness of domain-specific AI agents requires specialized evaluation frameworks that go beyond standard LLM benchmarks.

 

This session explores methodologies for assessing agent quality across specialized knowledge domains, tailored workflows, and task-specific objectives. We'll demonstrate practical approaches to designing robust LLM judges that align with your business goals and provide meaningful insights into agent capabilities and limitations.

 

Key session takeaways include:

  • Tools for creating domain-relevant evaluation datasets and benchmarks that accurately reflect real-world use cases
  • Approach for creating LLM judges to measure domain-specific metrics
  • Strategies for interpreting those results to drive iterative improvement in agent performance

 

Join us to learn how proper evaluation methodologies can transform your domain-specific agents from experimental tools to trusted enterprise solutions with measurable business value.

Session Speakers

IMAGE COMING SOON

Nikhil Thorat

/Software Engineer (GenAI Observability)
Databricks

Samraj Moorjani

/Software Engineer
Databricks