Parrot: Multilingual Visual Instruction Tuning

Hai-Long Sun; Da-Wei Zhou; Yang Li; Shiyin Lu; Chao Yi; Qing-Guo Chen; Zhao Xu; Weihua Luo; Kaifu Zhang; De-Chuan Zhan; Han-Jia Ye

Parrot: Multilingual Visual Instruction Tuning

Hai-Long Sun, Da-Wei Zhou, Yang Li, Shiyin Lu, Chao Yi, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, De-Chuan Zhan, Han-Jia Ye

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: The rapid development of Multimodal Large Language Models (MLLMs), such as GPT-4, marks a significant step toward artificial general intelligence. Existing methods typically align vision encoders with LLMs via supervised fine-tuning (SFT), but this often deteriorates their ability to handle multiple languages as training progresses. We empirically observe that imbalanced SFT datasets, largely English-centric, degrade performance on non-English languages due to the failure in multilingual token alignment. To address this, we propose Parrot, a novel approach that leverages textual guidance for visual token alignment at the language level. Parrot conditions visual tokens on diverse language inputs and uses Mixture-of-Experts (MoE) to align multilingual tokens. By computing cross-attention between initial visual features and textual embeddings, we select the most relevant experts, converting visual tokens into language-specific representations. Additionally, we introduce the Massive Multilingual Multimodal Benchmark (MMMB), a new benchmark comprising 6 languages, 15 categories, and 12,000 questions, to assess multilingual capabilities. Parrot achieves state-of-the-art performance on both the multilingual benchmarks and a wide range of multimodal tasks. Code and dataset are available at: \url{https://github.com/AIDC-AI/Parrot}.

Lay Summary: Smart AI systems that can understand both images and text (like the technology behind GPT-4) are becoming very powerful. However, when we teach these systems to connect images with words, they often get worse at understanding languages other than English. This is usually because most of the learning materials are in English, making the AI biased. We've created a new method called Parrot. Imagine Parrot looks at a picture; our system helps it understand that picture not just with English descriptions but also with guidance from many other languages. It’s like having different language specialists inside the AI, ensuring it correctly links the image's content to words in various languages. Parrot significantly improves how well these AI models understand images and text in multiple languages. This is vital for creating AI tools that are fair and useful for people all over the world, no matter what language they speak. To prove Parrot skills, we also built a new challenging test with questions in six different languages.

Link To Code: https://github.com/AIDC-AI/Parrot

Primary Area: Deep Learning->Large Language Models

Keywords: Multimodal Large Language Models; Multilingual MLLM; Mixture-of-Experts

Submission Number: 5793

Loading