Keywords: mutltilinguality, large language model, vision language models, multimodal models, image, video, cultural
TL;DR: We introduce a VLM that supports image and video called TowerVision, with improved multilingual capabilities explored via several ablations on data, base model, and vision encoders
Abstract: Despite significant advances in vision-language models (VLMs), most existing work follows an English-centric design process, limiting their effectiveness in multilingual settings. In this work, we provide a comprehensive empirical study analyzing the impact of several multilingual design choices, such as training data composition, encoder selection, and text backbones. The result is TowerVision, a family of open multilingual VLMs for both image-text and video-text tasks, built upon the multilingual text-only model Tower+. TowerVision achieves competitive performance on multiple multilingual benchmarks and shows particular strength in culturally grounded tasks and multimodal translation. By incorporating visual and cultural context during fine-tuning, our models surpass existing approaches trained on substantially larger datasets, as demonstrated on ALM-Bench and Multi30K (image tasks) and ViMUL-Bench (video tasks).
Alongside the models, we release VisionBlocks, a high-quality, curated vision-language dataset.
Our findings highlight that multilingual vision-language training data substantially improves cross-lingual generalization---both from high-resource to underrepresented languages and vice versa---and that instruction-tuned LLMs are not always the optimal initialization point.
To support further research, we publicly release all models, data, and training recipes.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 25253
Loading