Seeing is Understanding: Unlocking Causal Attention into Modality-Mutual Attention for Multimodal LLMs
Keywords: Multimodal Learning; Modality-Mutual Attention; Object Hallucination
Abstract: Recent Multimodal Large Language Models (MLLMs) have demonstrated significant progress in perceiving and reasoning over multimodal inquiries, ushering in a new research era for foundation models.
However, vision-language misalignment in MLLMs has emerged as a critical challenge, where the textual responses generated by these models are not factually aligned with the given text-image inputs.
Existing efforts to address vision-language misalignment have focused on developing specialized vision-language connectors or leveraging visual instruction tuning from diverse domains.
In this paper, we tackle this issue from a fundamental yet unexplored perspective by revisiting the core architecture of MLLMs.
Most MLLMs are typically built on decoder-only LLMs consisting of a causal attention mechanism, which **limits the ability of the earlier modalities (e.g., images) to incorporate information from the latter modalities (e.g., text)**.
To address this problem a MLLM that unlocks causal attention into our proposed modality-mutual attention (MMA) to enable image tokens to attend to text tokens.
This simple yet effective design allows MMA to achieve state-of-the-art performance in 12 multimodal understanding benchmarks (**+5.5\% on average across 2 LLMs backbones**) without introducing additional parameters.
Our MMA design is intended to be generic, allowing for applications across various modalities, and scalable to accommodate diverse multimodal scenarios.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 8141
Loading