VCM: Vision Concept Modeling with Adaptive Vision Token Compression via Instruction Fine-Tuning

Run Luo; Renke Shan; Longze Chen; Ziqiang Liu; Lu Wang; Min Yang; Xiaobo Xia

VCM: Vision Concept Modeling with Adaptive Vision Token Compression via Instruction Fine-Tuning

Run Luo, Renke Shan, Longze Chen, Ziqiang Liu, Lu Wang, Min Yang, Xiaobo Xia

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large vision-language models, vision concept modeling, token compression, efficiency

Abstract: Large vision-language models (LVLMs) have emerged as foundational tools for real-world AI applications. Despite their remarkable capabilities, current LVLMs process entire images at the token level, leading to significant inefficiencies compared to human cognition, which selectively focuses on high-level vision concepts. This token-level redundancy becomes increasingly problematic for high-resolution images and long video sequences, resulting in large computational costs and limited scalability in practical applications. To address this limitation, we introduce the concept of a vision concept model, a novel paradigm that enables LVLMs to dynamically extract the most relevant vision concepts from complex inputs, based on task-specific instructions. To optimize this vision concept modeling process, we propose VCM, a self-supervised framework that leverages vision-language correlations across diverse instances. VCM is designed to learn meaningful vision concepts without the need for expensive concept-level annotations. At its core, it employs a forward-backward optimization algorithm that supports LVLMs to adjust concept granularity and spatial alignment dynamically. Experiments demonstrate that VCM remarkably reduces computational costs (e.g., achieving up to 85\% fewer FLOPs for LLaVA-1.5-7B), while maintaining strong performance across a series of vision-language tasks. The codebase is available at https://github.com/RainBowLuoCS/VCM.

Primary Area: Applications (e.g., vision, language, speech and audio, Creative AI)

Submission Number: 1035

Loading