AceRead: Enhancing Vision-Language Understanding with a Semantic-Enhanced Querying Mechanism

ACL ARR 2025 February Submission2859 Authors

15 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Vision-Language models (VLMs) integrate visual and textual features through modality adaptors, demonstrating outstanding performance in image understanding tasks. Among these adaptors, compression-based approaches have gained particular prominence, as they can prune visual redundancy, highlight key details, and streamline computational costs. However, existing compression-based adaptors often fail to fully exploit the deep semantics of input questions, resulting in static or uninformative compressed features across different questions. In this study, we address this gap by leveraging question semantics to guide the compression of visual features. We propose a Semantic-Enhanced Resampler (SER), integrated into our VLM, AceRead, which serves as a conditional information bottleneck, channeling the most question-relevant information to the language model for answer generation. SER integrates semantic tokens with visual tokens and employs learnable queries to produce compressed representations. Additionally, AceRead incorporates an adaptive image encoder, enabling the processing of images with arbitrary sizes while minimizing distortion. Notably, AceRead achieves state-of-the-art performance, with improvements of 16% on TableVQABench and 10% on A-OKVQA, while requiring only 2.75% of the model's parameters to be trained. Our code and model are available at https://anonymous.4open.science/r/AceRead-77BF.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: semantic-enhanced, visual feature compression, vision-language model
Contribution Types: NLP engineering experiment, Approaches low compute settings-efficiency
Languages Studied: English
Submission Number: 2859
Loading