Improving Recall in Efficient Visual Language Models

Improving Recall in Efficient Visual Language Models

ACL ARR 2025 May Submission492 Authors

13 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Associative recall has emerged as a critical weakness in efficient language models and, as we demonstrate, is also a core bottleneck in efficient visual language models (VLMs). In this work, we show that efficient VLMs—exemplified by VisualRWKV—suffer from significant deficits in recall, particularly in text-centric tasks such as TextVQA and document understanding. Quantitatively, the baseline VisualRWKV-7B still trails the Transformer-based LLaVA-1.5-7B by 7.2 accuracy points on the TextVQA benchmark. We attribute this gap to a fundamental architectural limitation: insufficient input feature quality. To address this, we propose two effective processing strategies to enhance visual feature representations. First, our model incorporates SigLIP, DINOv2, and SAM to improve feature richness across resolutions, enabling the retention of multi-scale visual information without increasing the number of input visual tokens. Second, we introduce a segmentation-recombination strategy that supports ultra-high-resolution inputs (up to 4096×4096), allowing for precise and detailed feature extraction. These improvements significantly enhance recall performance and feature quality, enabling VisualRWKV-Boost-1.6B to outperform the larger baseline VisualRWKV-7B. Moreover, the performance gap on TextVQA compared with LLaVA-1.5-7B is reduced from 7.2 to just 1.9 accuracy points, paving the way for more scalable and efficient VLM architectures.

Paper Type: Long

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Research Area Keywords: vision question answering, cross-modal content generation, multimodality

Contribution Types: Model analysis & interpretability

Languages Studied: English, Chinese

Submission Number: 492

Loading