Sponsored by: Snorkel AI | The Art & Science of Benchmarking Agents
Overview
| Experience | In Person |
|---|---|
| Track | Artificial Intelligence & Agents |
| Industry | Enterprise Technology, Healthcare & Life Sciences, Financial Services |
| Technologies | Data Marketplace |
| Skill Level | Intermediate |
Our ability to measure AI has been outpaced by our ability to develop it, and this eval gap is one of the most important problems in AI. We need more enduring benchmarks to close this gap, and consequently advance entire new vectors of capabilities for the field. In this talk, I'll share our learnings evaluating agents, drawing from experience working with nearly all global frontier labs and leading academics. We'll discuss the science (i.e., mechanics that make benchmarks rigorous and effective) and art (i.e., intangibles driving ambitious and enduring benchmarks) of building great benchmarks. I'll close by sharing some of the learnings from Open Benchmarks Grants— a $3M initiative in partnership with Hugging Face, Together AI, Prime Intellect, Factory, and others.
Session Speakers
Saurabh Singh
/Staff Product Manager
Snorkel AI