Keywords: multimodal, mixture-of-transformers, vision-language models, fusion, reinforcement learning, fine-tuning, transfer
Abstract: The training of multimodal models involves many design choices, such as the underlying modality-specific tokenizers, fusion mechanisms, and strategies for freezing model layers during different training stages. However, the individual impact of these decisions on downstream multimodal performance remains poorly understood due to the diversity of current practices. In this paper, we systematically investigate how choices in image tokenization, architectural design, and layer-freezing strategies affect the training and cross-modal generalization of vision-language models (VLMs). We train and evaluate over 50 VLM variants across a controlled suite of tokenizers, model architectures, and training recipes. Our experiments reveal several key trends: (1) image tokenizers designed with text alignment in mind, together with training recipes that further enhance image-text alignment, yield the best performance; (2) unfreezing the language model boosts in-domain results but can degrade out-of-domain generalization; and (3) fusion architectures based on the mixture-of-transformers architecture are effective, especially when language parameters are frozen. To further probe cross-modality transfer, we introduce three new synthetic datasets, which we use to evaluate our pretrained models.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 20919
Loading