Session
Vibes to Validation: AI Eval for Partners
Overview
| Experience | In Person |
|---|
- Stop eyeballing Genie answers — start measuring them. In this hands-on session you'll attach a labeled evaluation set to a Genie Space, run the built-in benchmark, and read per-question scores to find out where it actually fails. Bring the Genie Space you built yesterday, or spin one up from a track.
- You'll write 15 questions across five failure-mode categories, label acceptable answers, and run the benchmark to pinpoint your weakest category. Then change one line of instructions, re-run, and watch the score move — sometimes the wrong way, which is a real finding. You'll leave with a completed benchmark run and a repeatable method you can use on a customer's Genie Space next week.
- Hands-on workshop - Bring a laptop - Free Edition account required - Lunch will be served.