Keywords: image captioning, lightweight model, refinement for VLM, vision language model, multimodal model
TL;DR: We explore lightweight image captioning for on-device deployment and propose Sharp-Eyed Refinement, a novel framework that improves caption quality.
Abstract: Image captioning is fundamental for applications like video-grounded chatbot systems and navigation robots, yet deploying such models on local devices is challenging due to the high computational demands of multimodal LLMs (MLLMs).
To address this, we first build lightweight captioning models using a 125M-parameter language model, 56 times smaller than LLaMA-7B, and evaluate their performance not only on single-sentence but on detailed captioning tasks.
Surprisingly, we find that our model can achieve performance comparable to MLLMs, suggesting its potential to serve as a strong captioning specialist for on-device applications.
While promising, our model also exhibits a limitation: like other MLLMs, it suffers from occasional captioning errors.
We investigate the underlying causes and observe that the problems stem from ineffective attention mechanisms and limited visual representations.
To alleviate them, we develop a novel captioning framework, Sharp-Eyed Refinement, which enhances caption quality by refining coarse descriptions into more precise captions.
At its core, DeepLens improves visual grounding by re-examining the informative regions identified in the initial glance.
Experimental results demonstrate the superiority of our model over both recent lightweight captioning methods and MLLMs in detailed captioning and even in long-range video QA tasks.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 3588
Loading