Understanding the Design Space and Cross-Modality Transfer for Vision-Language Models

19 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: multimodal, mixture-of-transformers, vision-language models, fusion, reinforcement learning, fine-tuning, transfer
Abstract: The training of multimodal models involves many design choices, such as the underlying modality-specific tokenizers, fusion mechanisms, and strategies for freezing model layers during different training stages. However, the individual impact of these decisions on downstream multimodal performance remains poorly understood due to the diversity of current practices. In this paper, we systematically investigate how choices in image tokenization, architectural design, and layer-freezing strategies affect the training and cross-modal generalization of vision-language models (VLMs). We systematically explore a design space comprising six image tokenizers, three VLM architectural variants, and various parameter-freezing strategies. To further probe cross-modality transfer, we introduce three new synthetic datasets, which we use to evaluate our pretrained models. Our experiments reveal several key trends. (i) Image tokenizers trained with text-aware objectives are crucial for strong VLM performance, outperforming those trained without such objectives on both in-domain and out-of-domain tasks. (ii) Architectures that explicitly separate modalities such as the Mixture-of-Transformers fusion architecture, along with training recipes that preserve the more general textual knowledge and reasoning of the base language model, generalize well to out-of-domain tasks. (iii) Cross-modality transfer is heavily dependent on representational alignment between the text and images; in our synthetic setting, image-to-text transfer is comparatively strong, whereas there was little text-to-image transfer.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 20919
Loading