Doubling Medical Safety: Fine-Tuning Open LLMs for Women's Health Without Human Labels
Overview
| Experience | In Person |
|---|---|
| Track | Artificial Intelligence & Agents |
| Industry | Healthcare & Life Sciences |
| Technologies | AI/BI, Databricks Apps, Agent Bricks |
| Skill Level | Intermediate |
Enterprises building LLM features in healthcare hit the same wall: satisfying dozens of safety rules simultaneously—crisis escalation, treatment boundaries, referral language—while real user data is off-limits and expert labeling is prohibitively expensive.We'll show how Flo Health broke through using RFT-inspired synthetic fine-tuning, transforming Llama 3.3 70B into a healthcare-compliant assistant for women's health that doubled safety compliance versus our previous iteration. The key insight: instead of investing expert time in labeling, we redirected it into designing LLM judges that scale.Our system uses 60 LLM judges—52 for medical safety, 8 for usefulness—with priority-weighted reward aggregation where P1 safety rules dominate over P2 quality rules. You'll learn patterns for multi-judge evaluation systems, reward aggregation strategies for binary constraints, and why simpler approaches beat complex alternatives. For anyone building AI where "mostly safe" isn't good enough.
Session Speakers
Vladislav Nedosekin
/Director of Engineering - AI Platform
Flo Health
Michael Shtelma
/Lead Product Specialist - GenAI
Databricks
Andras Meczner
/Director of Medical Accuracy & Safety
Flo Health