TowerVision : Understanding and Improving Multilinguality in Vision-Language Models

TowerVision : Understanding and Improving Multilinguality in Vision-Language Models

ICLR 2026 Conference Submission25253 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: mutltilinguality, large language model, vision language models, multimodal models, image, video, cultural

TL;DR: We introduce a VLM that supports image and video called TowerVision, with improved multilingual capabilities explored via several ablations on data, base model, and vision encoders

Abstract: Despite significant advances in vision-language models (VLMs), most existing work follows an English-centric design process, limiting their effectiveness in multilingual settings. In this work, we provide a comprehensive empirical study analyzing the impact of several multilingual design choices, such as training data composition, encoder selection, and text backbones. The result is TowerVision, a family of open multilingual VLMs for both image-text and video-text tasks, built upon the multilingual text-only model Tower+. TowerVision achieves competitive performance on multiple multilingual benchmarks and shows particular strength in culturally grounded tasks and multimodal translation. By incorporating visual and cultural context during fine-tuning, our models surpass existing approaches trained on substantially larger datasets, as demonstrated on ALM-Bench and Multi30K (image tasks) and ViMUL-Bench (video tasks). Alongside the models, we release VisionBlocks, a high-quality, curated vision-language dataset. Our findings highlight that multilingual vision-language training data substantially improves cross-lingual generalization---both from high-resource to underrepresented languages and vice versa---and that instruction-tuned LLMs are not always the optimal initialization point. To support further research, we publicly release all models, data, and training recipes.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 25253

Loading