ASCD: Attention-Steerable Contrastive Decoding for Reducing Hallucination in MLLM

ASCD: Attention-Steerable Contrastive Decoding for Reducing Hallucination in MLLM

ACL ARR 2025 May Submission4141 Authors

19 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Multimodal Large Language Model (MLLM) often suffer from hallucinations. They over-rely on partial cues and generate incorrect responses. Recently, methods like Visual Contrastive Decoding (VCD) and Instruction Contrastive Decoding (ICD) have been proposed to mitigate hallucinations by contrasting predictions from perturbed or negatively prefixed inputs against original outputs. In this work, we uncover that methods like VCD and ICD fundamentally influence the model’s internal attention dynamics. This observation suggests that their effectiveness may not stem merely from surface-level modifications to logits but from deeper shifts in attention distribution. Inspired by this insight, we propose an attention-steerable contrastive decoding framework that directly intervenes in the model’s attention mechanisms to offer a more principled approach to mitigating hallucinations. Specifically, we introduce positive and negative steering as two complementary directions for adapting the model’s internal attention distributions. Rather than passively adjusting logits -- as it is commonly done -- our method dynamically modulates attention pathways within the contrastive decoding process. This enables selective enhancement or suppression of visual feature contributions in a structured manner. Furthermore, we propose a dynamic selection mechanism to identify text-centric heads -- those that predominantly attend to text over visual features -- for targeted positive steering, as well as a complementary mechanism to select the most critical visual tokens for negative steering, enabling more effective attention adjustments. Our experiments across multiple MLLM architectures (e.g., LLaVA-1.5 7B, LLaVA-NeXT 7B, Phi2-SigLIP) and diverse decoding methods (greedy search, beam search, nucleus sampling) demonstrate that our approach significantly reduces hallucinations and improves the performance on benchmarks such as POPE, CHAIR, and MMHal-Bench, while simultaneously enhancing performance on standard VQA benchmarks, including MMMU, MM-VET, ScienceQA, TextVQA, and GQA.

Paper Type: Long

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Research Area Keywords: vision question answering, multimodality, cross-modal content generation

Contribution Types: Model analysis & interpretability, Reproduction study, Approaches to low-resource settings, Approaches low compute settings-efficiency

Languages Studied: English

Submission Number: 4141

Loading