VDEP: Establishing Equivalence Between Image and Text Token Through Autoregressive Pre-training in MLLMs

ACL ARR 2025 May Submission940 Authors

16 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Multimodal large language models (MLLMs) often underutilize visual information, leading to imbalanced alignment and limited performance. Through theoretical analysis, we reveal that existing alignment objectives risk collapsing into a unimodal, text-only training process. To address this, we propose Visual Dynamic Embedding-guided Pretraining (VDEP) a hybrid autoregressive framework that supervises image-related hidden states via dynamic embeddings from an MLP appended to the visual encoder. VDEP integrates visual tokens into training without added architectural complexity, reframing alignment as an information recovery task focused on fine-grained visual semantics. Our model-agnostic method consistently outperforms strong baselines across 13 benchmarks, setting a new standard for large-scale vision-language alignment. Code and models are available at https://github.com/anonymous-gpu/VDEP_LLava_1.5.git.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: MLLM, Multimodal Alignment, Visual Language Model, Pre-training paradigm
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Publicly available software and/or pre-trained models, Theory
Languages Studied: English, Chinese
Submission Number: 940
Loading