Multiple Images Distract Large Multimodal Models via Attention Fragmentation

01 Sept 2025 (modified: 04 Dec 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Multi-image Understanding, Large Multimodal Models
Abstract: Many everyday tasks involve integrating information across multiple images, such as comparing photos and reading social media posts. Recent large multimodal models (LMMs) therefore accept multiple images, yet open-source systems remain far from reliable in multi-image understanding, with accuracies often falling below 50\% on recent evaluations. We analyse how these models allocate attention across images when visual tokens are processed in a single autoregressive, causally masked sequence. Our study uncovers a joint failure mode: the same background positions in each image repeatedly attract high attention while contributing little to prediction, and this effect is stronger for earlier images due to one-way attention under causal masking. We term this phenomenon attention fragmentation, as attention is split across non-informative tokens instead of binding evidence between images. These high-attention, low-utility tokens correspond to attention sinks previously observed in LLMs. To address attention fragmentation, we introduce Attention Remasking (AR), a zero-parameter, post-training edit that operates on attention scores where the causal mask is enforced. AR masks sink tokens column-wise to prevent any query from attending to them, and selectively unmasks cross-image visual tokens deemed relevant by a grounded patch relevance score. The attention scores freed from the masked sinks are reassigned to these unmasked links, creating forward connections from earlier to later images while preserving text autoregression. AR reduces attention fragmentation and improves accuracy over post-training baselines on recent multi-image benchmarks, delivering more effective cross-image integration without additional training.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 651
Loading