MLAN: Language-Based Instruction Tuning Preserves and Transfers Knowledge in Multimodal Language Models

Published: 07 Jul 2025, Last Modified: 09 Jul 2025KnowFM @ ACL 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Knowledge; Transfer Learning; Instruction Tuning; Multimodal LLM
TL;DR: We present a novel instruction tuning recipe to improve the zero-shot task generalization of multimodal large language models.
Abstract: We present a novel visual instruction tuning strategy to improve the zero-shot task generalization of multimodal large language models by building a firm text-only knowledge base. Existing work lacks sufficient experimentation on the importance of each modality in the instruction tuning stage, often using a majority of vision-language data while keeping text-only data limited and fixing mixtures of modalities. By incorporating diverse text-only data in the visual instruction tuning stage, we vary vision-language data in various controlled experiments to investigate the importance of modality in visual instruction tuning. Our comprehensive evaluation shows that the text-heavy instruction tuning approach is able to perform on par with traditional vision-heavy mixtures on both modalities across 12 general datasets while using as low as half the total training tokens. We find that simply increasing sufficiently diverse text-only data enables transfer of instruction following ability and domain knowledge across modalities while being more efficient than the vision-language approach.
Archival Status: Archival (included in proceedings)
Submission Number: 35
Loading