Keywords: Large Vision-Language Models, Feature Modulation, Efficient
Abstract: Despite the impressive advancements of Large Vision-Language Models (LVLMs), existing approaches suffer from a fundamental bottleneck: inefficient visual-language integration. Current methods either disrupt the model's inherent structure or introduce long-context computational burdens, severely limiting scalability and efficiency. In this paper, we rethink multimodal integration and present LaVi, a novel LVLM that enables seamless and efficient vision-language fusion through internal feature modulation within the Large Language Models (LLMs). Unlike dominant LVLMs that rely on visual token concatenation, LaVi sidesteps long-context expansion by injecting vision-conditioned deltas into the affine parameters of LayerNorm, a ubiquitous component in modern LLMs. This lightweight transformation makes visual input directly modulate the linguistic hidden states, grounding the next-token probabilities in visual evidence. LaVi achieves precise vision–language alignment while retaining the linguistic priors and substantially reducing computation. Across 18 benchmarks covering images, video, and language, LaVi delivers superior or comparable performance with substantial efficiency gains. In addition, it preserves strong linguistic capability. Compared to LLaVA-OV-7B, it reduces FLOPs by 94.0%, accelerates inference by 3.1×, and halves memory consumption. These properties make LaVi a scalable and practical framework for real-time multimodal reasoning. Code and models will be released.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 2883
Loading