Understanding the Design Space and Cross-Modality Transfer for Vision-Language Models

Understanding the Design Space and Cross-Modality Transfer for Vision-Language Models

ICLR 2026 Conference Submission20919 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: multimodal, mixture-of-transformers, vision-language models, fusion, reinforcement learning, fine-tuning, transfer

Abstract: The training of multimodal models involves many design choices, such as the underlying modality-specific tokenizers, fusion mechanisms, and strategies for freezing model layers during different training stages. However, the individual impact of these decisions on downstream multimodal performance remains poorly understood due to the diversity of current practices. In this paper, we systematically investigate how choices in image tokenization, architectural design, and layer-freezing strategies affect the training and cross-modal generalization of vision-language models (VLMs). We train and evaluate over 50 VLM variants across a controlled suite of tokenizers, model architectures, and training recipes. Our experiments reveal several key trends: (1) image tokenizers designed with text alignment in mind, together with training recipes that further enhance image-text alignment, yield the best performance; (2) unfreezing the language model boosts in-domain results but can degrade out-of-domain generalization; and (3) fusion architectures based on the mixture-of-transformers architecture are effective, especially when language parameters are frozen. To further probe cross-modality transfer, we introduce three new synthetic datasets, which we use to evaluate our pretrained models.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 20919

Loading