SMART: Sink-based Modality-Aware Redistribution of Transformer Attention

ACL ARR 2026 January Submission3680 Authors

04 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Multimodal Large Language Models; Modality Bias; Attention Sinks
Abstract: Multimodal Large Language Models (MLLMs) integrate visual and textual information, yet often exhibit modality bias, where predictions over-rely on one modality while underutilizing the other. Through analysis, we find that modality bias in MLLMs arises from imbalanced Transformer attention distribution. The dominant modality tends to receive disproportionately high attention. Moreover, low-information sink tokens absorb redundant attention that could otherwise be allocated to the under-attended modality. Motivated by this, we propose SMART (Sink-based Modality-Aware Redistribution of Transformer Attention), an inference-time method that detects modality-specific attention sinks and redistributes excessive attention to the under-attended modality. To better quantitatively assess modality bias, we construct Banana-Counting, a diagnostic dataset of 1,026 instances with mirrored information across visual and textual modalities. Our evaluation across ten MLLMs reveals severe modality bias, with some models exhibiting over 20-point accuracy gaps between visual and textual data. SMART effectively reduces the modality bias gap from 27.73 to 0.66 and improves balanced accuracy by up to 29.75\%. Moreover, these gains consistently generalize to downstream tasks including VQA-v2, GQA, and ScienceQA, indicating that mitigating modality bias improves both robustness and generalization.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: cross-modal content generation, vision question answering
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data resources, Data analysis
Languages Studied: English
Submission Number: 3680
Loading