Detecting and Reducing Manipulative AI Outputs With Interpretable Governance Signals
Overview
| Experience | In Person |
|---|---|
| Track | Governance & Security |
| Industry | Communications, Media & Entertainment, Public Sector |
| Technologies | AI/BI |
| Skill Level | Intermediate |
Large language models are increasingly embedded in data platforms and decision workflows, yet they can generate manipulative and propagandistic content at scale, often without obvious policy violations. In this talk, I present a production-oriented approach to detecting and mitigating manipulation in generative AI systems using interpretable, data-driven signals.
Drawing on real-world pipelines I’ve built, I show how manipulative techniques can be operationalized as measurable indicators within AI evaluation workflows. I demonstrate how these signals can be used to monitor model behavior, compare outputs across model versions, and guide alignment strategies that significantly reduce harmful generation while preserving utility.
This session focuses on how to integrate responsible AI checks into data pipelines, how to reason about governance beyond binary safety flags, and how interpretable metrics enable more trustworthy AI systems at scale.
Session Speakers
Julia Jose
/PhD Candidate
New York University