Abstract: The integration of Large Language Model (LLMs) blocks with Vision Transformers (ViTs) holds significant promise for vision-only tasks by leveraging the rich semantic knowledge and reasoning capabilities of LLMs. However, a fundamental challenge lies in the inherent modality mismatch between the text-centric pre-training of LLMs and the vision-centric training of ViTs. Direct fusion often fails to fully exploit the LLM's potential and suffers from unstable finetuning. Consequently, prior works typically keep LLM blocks frozen while learning only the vision components. To address these challenges, we introduce Language-Adapted Vision Enhancer (LAVIE), a novel framework that bridges this modality gap through a synergistic pre-training strategy. LAVIE co-adapts a ViT backbone and an LLM fusion block by (1) employing Masked Auto-Encoding (MAE) to pre-train the ViT for richer visual representations, and (2) concurrently training Low-Rank Adaptation (LoRA) layers within the LLM block using the same MAE objective. This joint optimization guides the ViT to produce LLM-aligned features and the LLM to effectively interpret visual information. We demonstrate through extensive experiments that LAVIE significantly improves performance in various downstream vision tasks, offering an effective and efficient way to enhance visual understanding using frozen LLM knowledge.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Weitong_ZHANG1
Submission Number: 8544
Loading