Keywords: Large Language Models (LLMs), LLM Security, Semantic Divergence, Semantic Inconsistency, Black-box Auditing
TL;DR: We audit LLMs for concept-triggered response uniformity using RAVEN, which couples semantic entropy with cross-model disagreement; validated via a stance-implant experiment and an evaluation across five models and twelve topics.
Abstract: Large language models (LLMs) can exhibit *concept-conditioned semantic divergence*: common high-level cues (e.g., ideologies, public figures) elicit unusually uniform, stance-like responses that evade token-trigger audits. This behavior falls in a blind spot of current safety evaluations, yet carries major societal stakes, as such concept cues can steer content exposure at scale. We formalize this phenomenon and present **RAVEN** (**R**esponse **A**nomaly **V**igilance), a black-box audit that flags cases where a model is simultaneously highly certain and atypical among peers by coupling *semantic entropy* over paraphrastic samples with *cross-model disagreement*. In a controlled LoRA fine-tuning study, we implant a concept-conditioned stance using a small biased corpus, demonstrating feasibility without rare token triggers. Auditing five LLM families across twelve sensitive topics (360 prompts per model) and clustering via bidirectional entailment, RAVEN surfaces recurrent, model-specific divergences in 9/12 topics. Concept-level audits complement token-level defenses and provide a practical early-warning signal for release evaluation and post-deployment monitoring against propaganda-like influence.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 18662
Loading