Keywords: multimodal large language models, vision-centric activation, vision-centric coordination
TL;DR: In this paper, we introduce VaCo, a framework that optimizes MLLM representations through visual activation and coordination derived from multiple vision foundation models (VFMs).
Abstract: Multimodal large language models (MLLMs) integrate image features from visual encoders with LLMs, demonstrating advanced comprehension capabilities. However, mainstream MLLMs are solely supervised by the next-token prediction of textual tokens, neglecting critical vision-centric information essential for analytical abilities. To track this dilemma, we introduce **VaCo**, which optimizes MLLM representations through **V**ision-Centric **a**ctivation and **Co**ordination from multiple vision foundation models (VFMs). VaCo introduces visual discriminative alignment to integrate task-aware perceptual features extracted from VFMs, thereby unifying the optimization of both textual and visual outputs in MLLMs. Specifically, we incorporate the learnable *Modular Task Queries* (MTQs) and *Visual Alignment Layers* (VALs) into MLLMs, activating specific visual signals under the supervision of diverse VFMs. To coordinate representation conflicts across VFMs, the crafted *Token Gateway Mask* (TGM) restricts the information flow among multiple groups of MTQs. Extensive experiments demonstrate that VaCo significantly improves the performance of different MLLMs on various benchmarks, showcasing its superior capabilities in visual comprehension.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 2982
Loading