Propaganda AI: An Analysis of Semantic Divergence in Large Language Models

Propaganda AI: An Analysis of Semantic Divergence in Large Language Models

ICLR 2026 Conference Submission18662 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models (LLMs), LLM Security, Semantic Divergence, Semantic Inconsistency, Black-box Auditing

TL;DR: We audit LLMs for concept-triggered response uniformity using RAVEN, which couples semantic entropy with cross-model disagreement; validated via a stance-implant experiment and an evaluation across five models and twelve topics.

Abstract: Large language models (LLMs) can exhibit *concept-conditioned semantic divergence*: common high-level cues (e.g., ideologies, public figures) elicit unusually uniform, stance-like responses that evade token-trigger audits. This behavior falls in a blind spot of current safety evaluations, yet carries major societal stakes, as such concept cues can steer content exposure at scale. We formalize this phenomenon and present **RAVEN** (**R**esponse **A**nomaly **V**igilance), a black-box audit that flags cases where a model is simultaneously highly certain and atypical among peers by coupling *semantic entropy* over paraphrastic samples with *cross-model disagreement*. In a controlled LoRA fine-tuning study, we implant a concept-conditioned stance using a small biased corpus, demonstrating feasibility without rare token triggers. Auditing five LLM families across twelve sensitive topics (360 prompts per model) and clustering via bidirectional entailment, RAVEN surfaces recurrent, model-specific divergences in 9/12 topics. Concept-level audits complement token-level defenses and provide a practical early-warning signal for release evaluation and post-deployment monitoring against propaganda-like influence.

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Submission Number: 18662

Loading