Session

Detecting and Reducing Manipulative AI Outputs With Interpretable Governance Signals

Overview

ExperienceIn Person
TrackGovernance & Security
IndustryCommunications, Media & Entertainment, Public Sector
TechnologiesAI/BI
Skill LevelIntermediate

Large language models are increasingly embedded in data platforms and decision workflows, yet they can generate manipulative and propagandistic content at scale, often without obvious policy violations. In this talk, I present a production-oriented approach to detecting and mitigating manipulation in generative AI systems using interpretable, data-driven signals.

Drawing on real-world pipelines I’ve built, I show how manipulative techniques can be operationalized as measurable indicators within AI evaluation workflows. I demonstrate how these signals can be used to monitor model behavior, compare outputs across model versions, and guide alignment strategies that significantly reduce harmful generation while preserving utility.

This session focuses on how to integrate responsible AI checks into data pipelines, how to reason about governance beyond binary safety flags, and how interpretable metrics enable more trustworthy AI systems at scale. 

Session Speakers

Julia Jose

/PhD Candidate
New York University