Adapted-Language ViT: Empowering Self-Supervised Vision Transformers with LLMs

17 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: representation learning, transformers, vision-language modeling
TL;DR: The paper proposes a method that combines self-supervised learning with adapted large language model blocks to enhance vision transformers, demonstrating strong performance and robustness across multiple benchmarks.
Abstract: The integration of Large Language Model~(LLMs) blocks with Vision Transformers (ViTs) holds immense promise for vision-only tasks by leveraging the rich semantic knowledge and reasoning capabilities of LLMs. However, a fundamental challenge lies in the inherent modality mismatch between text-centric pretraining of LLMs and vision-centric training of ViTs. Direct fusion often fails to fully exploit the LLM's potential and suffers from unstable finetuning. As a result, LLM blocks are kept frozen while only the vision components are learned. As a remedy to these challenges, we introduce Adapted-Language Vision Transformers (ALViT), a novel approach that bridges this modality mismatch through a synergistic pre-training strategy. ALViT co-adapts a ViT backbone and an LLM fusion block by (1) employing Masked Auto-Encoding (MAE) to pre-train the ViT for richer visual representations, and (2) concurrently training Low-Rank Adaptation (LoRA) layers within the LLM block using the MAE objective. This joint optimization guides the ViT to produce LLM-aligned features and the LLM to effectively interpret visual information. We demonstrate through extensive experiments that ALViT significantly improves performance in various downstream vision tasks, showcasing an effective and efficient way to harness LLM knowledge for visual understanding.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 9258
Loading