Session

Doubling Medical Safety: Fine-Tuning Open LLMs for Women's Health Without Human Labels

Overview

Experience	In Person
Track	Artificial Intelligence & Agents
Industry	Healthcare & Life Sciences
Technologies	AI/BI, Databricks Apps, Databricks Agents
Skill Level	Intermediate

Enterprises building LLM features in healthcare hit the same wall: satisfying dozens of safety rules simultaneously—crisis escalation, treatment boundaries, referral language—while real user data is off-limits and expert labeling is prohibitively expensive.

We'll show how Flo Health broke through using RFT-inspired synthetic fine-tuning, transforming Llama 3.3 70B into a healthcare-compliant assistant for women's health that doubled safety compliance versus our previous iteration. The key insight: instead of investing expert time in labeling, we redirected it into designing LLM judges that scale.

Our system uses 60 LLM judges—52 for medical safety, 8 for usefulness—with priority-weighted reward aggregation where P1 safety rules dominate over P2 quality rules. You'll learn patterns for multi-judge evaluation systems, reward aggregation strategies for binary constraints, and why simpler approaches beat complex alternatives. For anyone building AI where "mostly safe" isn't good enough.

Doubling Medical Safety: Fine-Tuning Open LLMs for Women's Health Without Human Labels

Overview

Session Speakers

Andras Meczner

Michael Shtelma

Vladislav Nedosekin