Visual Attention Never Fades: Selective Progressive Attention ReCalibration for Detailed Image Captioning in Multimodal Large Language Models
Abstract: Detailed image captioning is essential for tasks like data generation and aiding visually impaired individuals. High-quality captions require a balance between precision and recall, which remains challenging for current multimodal large language models (MLLMs). In this work, we hypothesize that this limitation stems from weakening and increasingly noisy visual attention as responses lengthen. To address this issue, we propose SPARC (Selective Progressive Attention ReCalibration), a training-free method that enhances the contribution of visual tokens during decoding. SPARC is founded on three key observations: (1) increasing the influence of all visual tokens reduces recall; thus, SPARC selectively amplifies visual tokens; (2) as captions lengthen, visual attention becomes noisier, so SPARC identifies critical visual tokens by leveraging attention differences across time steps; (3) as visual attention gradually weakens, SPARC reinforces it to preserve its influence. Our experiments, incorporating both automated and human evaluations, demonstrate that existing methods improve the precision of MLLMs at the cost of recall. In contrast, our proposed method enhances both precision and recall with minimal computational overhead.
Lay Summary: When computers describe images in detail — a task important for things like creating new data and helping people who are visually impaired — they need to be both accurate (precision) and thorough (recall). However, today’s advanced models often struggle to balance these two goals.
We found that as these models generate longer descriptions, their focus on the image starts to blur, and they rely less on the actual visual content, leading to errors. To fix this, we created a simple technique called SPARC. It works by carefully boosting the most important visual details while the model writes the caption, helping it stay focused on the image even as the description grows longer.
Unlike other methods that improve precision but harm recall, SPARC improves both, and it does so without needing extra training or heavy computation. This can make automatic image descriptions more useful and reliable.
Primary Area: Deep Learning->Attention Mechanisms
Keywords: Multimodal Large Language Models, Detailed Image Captioning, Attention-based Strategies
Submission Number: 1292
Loading