Abstract: Multimodal models have achieved state-of-the-art performance for English language due to abundant high-quality multimodal data (image-text and audio-text). However, the performance for other languages is lower due to limited high-quality multilingual-multimodal data. Current state-of-the-art methods use automatic translations to create and evaluate Multilingual Multimodal models. Meanwhile, the availability of multilingual text data and robust self-supervised methods has grown significantly, leading to powerful multilingual text models. In this work, we leverage the strong multilingual semantic alignment of text models and align them with multimodal models. We demonstrate that learning just a few linear layers can transform multilingual text representations into multimodal text representations that are compatible with the rest of the multimodal model. Our method, M2M, uses only English text data for learning the transformation/alignment. It achieves 95.3\% Recall@10 on English language (0.3\% higher than the baseline model) and 89.2\% Recall@10 averaged across 11 languages (10 of which are unseen during alignment) for the Text-to-Image retrieval task on the XTD dataset. M2M generalizes across architectures, datasets, modalities, and tasks (Image-Text, Audio-Text retrieval, and Cross-lingual Text-to-Image generation). Code, checkpoints, and data will be publicly released (https://github.com/m2m-acl25/M2M).
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: multimodality, multilingualism, image-text retrieval, audio-text, text-to-image generation, multilingual evaluation
Contribution Types: Model analysis & interpretability, Publicly available software and/or pre-trained models, Data resources
Languages Studied: Arabic, Bengali, Czech, German, Greek, English, French, Gujarati, Hebrew, Hindi, Indonesian, Italian, Japanese, Kannada, Korean, Malayalam, Marathi, Dutch, Nepali, Punjabi, Persian, Polish, Portuguese, Romanian, Russian, Spanish, Tamil, Telugu, Turkish, Ukrainian, Urdu, Vietnamese, Chinese (Simplified), Chinese (Traditional)
Submission Number: 2362
Loading