TowerVision: Understanding and Improving Multilinguality in Vision-Language Models

Andre G. Viveiros; Patrick Fernandes; Saul Santos; Sonal Sannigrahi; Emmanouil Zaranis; Nuno M Guerreiro; Amin Farajian; Graham Neubig; Andre Martins

TowerVision: Understanding and Improving Multilinguality in Vision-Language Models

Andre G. Viveiros, Patrick Fernandes, Saul Santos, Sonal Sannigrahi, Emmanouil Zaranis, Nuno M Guerreiro, Amin Farajian, Graham Neubig, Andre Martins

Published: 02 Mar 2026, Last Modified: 13 Mar 2026ICLR 2026 Workshop MM Intelligence PosterEveryoneRevisionsCC BY 4.0

Track: long paper (up to 8 pages)

Keywords: mutltilinguality, large language model, vision language models, multimodal models, image, video, cultural, cross-lingual generalization

TL;DR: We introduce a VLM that supports image and video called TowerVision, with improved multilingual capabilities explored via several ablations on data, base model, and vision encoders

Abstract: Despite rapid progress in vision-language models (VLMs), most existing approaches remain English-centric, often relying on undisclosed training data or recipes limiting their effectiveness and reproducibility in multilingual settings. In this work, we present a systematic empirical study of how to best incorporate multilinguality across training data, encoder choices, and language models. Our results show that high-quality multilingual vision-language data substantially improves cross-lingual generalization, enabling effective transfer both from high-resource to underrepresented languages and in the opposite direction. We further find that language models with strong multilingual priors are often more effective than initializing from general-purpose language models. Guided by these findings, we design TowerVision, a family of open-source multilingual VLMs, built on the multilingual text-only model Tower+. TowerVision-9B achieves competitive performance across a range of multimodal multilingual benchmarks, with particular strength in culturally grounded tasks and multimodal translation. Notably, our models outperform existing approaches trained on substantially larger datasets, as shown on ALM-Bench and Multi30K. Along with the models, we release VisionBlocks, a high-quality, curated vision-language dataset.

Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 35

Loading