Keywords: Valence, Threatening, Supportive, Neutral, Emotion, Framing, Performance
TL;DR: We show that emotional prompt framing exposes adversarial vulnerabilities in LLMs, with misaligned models destabilized while aligned models remain robust, introducing emotional robustness as a missing axis in alignment evaluation.
Abstract: Aligned and misaligned large language models (LLMs) respond in fundamentally different ways to emotional prompt framing, revealing a critical dimension of adversarial vulnerability. We evaluate model performance across neutral, supportive, and threatening valences, with graded intensities, using both MMLU-derived benchmarks and a custom dataset designed to surface valence effects. The custom dataset highlights framing impacts more clearly than standard benchmarks, underscoring its utility as a complementary evaluation tool. Across 1,350 prompts spanning academic domains, we assess responses using a structured rubric measuring factual accuracy, coherence, depth, linguistic quality, instruction sensitivity, and creativity. Results show that aligned models remain stable, with valence affecting only stylistic features, while misaligned models are fragile: threatening prompts induce volatile swings between over-compliance and degraded reliability, amplified under stronger intensities. Supportive framing enriches phrasing but introduces variability, revealing a tradeoff between engagement and stability. Together, these findings establish emotional robustness as a missing component in current alignment methods and identify prompt valence as an underexplored adversarial axis. The contrast between aligned and misaligned models demonstrates that valence stress-testing can serve both as a diagnostic for alignment quality and as evidence that existing safety measures may fail under emotionally charged interactions.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 21108
Loading