AC-Bench: Do Multimodal Models Truly See Emotion Amidst Textual Interference?

ACL ARR 2026 January Submission9154 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Visual Emotion Analysis, Vision-Language Models, Inference-time Steering, Affective Hijacking
Abstract: Recent advancements in Vision-Language Models (VLMs) have catalyzed a paradigm shift in vision-centric tasks, yet their ability to resolve cross-modal inconsistency in Visual Emotion Analysis (VEA) remains underexplored. To address this gap, we introduce \textbf{AC-Bench}, a novel benchmark comprising 12,604 instances across six fine-grained subtasks, specifically designed to evaluate a model's resistance to deceptive textual emotion guidance. Through a comprehensive evaluation of 9 VLMs, we identify a pervasive \textbf{"Affective Hijacking"} phenomenon and present four key findings across behavioral and mechanistic dimensions, revealing that models often exhibit a blind trust in textual descriptors at the expense of salient visual evidence. To mitigate this bias, we propose \textbf{CECS}, a training-free inference-time attention reallocation method that restores visual groundedness and significantly reduces affective hijacking under cross-modal conflict.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: image text matching,vision question answering,multimodality
Contribution Types: Model analysis & interpretability, Data resources, Data analysis
Languages Studied: English
Submission Number: 9154
Loading