Session

Improve AI Training With the First Synthetic Personas Dataset Aligned to Real-World Distributions

Overview

ExperienceIn Person
TypeBreakout
TrackArtificial Intelligence
IndustryEnterprise Technology, Health and Life Sciences, Financial Services
TechnologiesApache Spark, AI/BI, Llama
Skill LevelIntermediate

A big challenge in LLM development and synthetic data generation is ensuring data quality and diversity. While data incorporating varied perspectives and reasoning traces consistently improves model performance, procuring such data remains impossible for most enterprises. Human-annotated data struggles to scale, while purely LLM-based generation often suffers from distribution clipping and low entropy. In a novel compound AI approach, we combine LLMs with probabilistic graphical models and other tools to generate synthetic personas grounded in real demographic statistics. The approach allows us to address major limitations in bias, licensing and persona skew of existing methods. We release the first open source dataset aligned with real-world distributions and show how enterprises can leverage it with its Gretel Navigator extensions to bring diversity and quality to model training on the Databricks Platform, all while addressing model collapse and data provenance concerns head-on.

Session Speakers

IMAGE COMING SOON

Yev Meyer

/Chief Scientist
Gretel

Dane Corneil

/Staff Applied Scientist
Gretel.ai