Looking Beyond Text: Reducing Language bias in Large Vision-Language Models via Multimodal Dual-Attention and Soft-Image Guidance

ACL ARR 2025 May Submission3646 Authors

19 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Large vision-language models (LVLMs) have achieved impressive results in vision-language tasks. However, Therefore, we propose LACING, designed to address such bias with Mu$\underline{\textbf{L}}$timodal Du$\underline{\textbf{A}}$l-attention Me$\underline{\textbf{C}}$han$\underline{\textbf{I}}$sm (MDA) a$\underline{\textbf{N}}$d Soft-Image $\underline{\textbf{G}}$uidance (SIG). Specifically, MDA adopts a $\textbf{parallel dual-attention mechanism}$ that constructs separate attention for visual and text inputs to enhance integration of visual inputs across model. SIG uses a $\textbf{learnable soft visual prompt}$ during training and inference to replace visual inputs, designed to compel LVLMs to prioritize text inputs during inference. Experiments across different model architectures and scales demonstrate that LACING effectively debiases LVLMs from their language bias, enhancing visual comprehension and reducing hallucinations without additional resources.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: vision language navigation, cross-modal pretraining, vision question answering,
Contribution Types: Model analysis & interpretability, NLP engineering experiment
Languages Studied: English
Submission Number: 3646
Loading