M$^2$-VLP: Enhancing Multilingual Vision-Language Pre-Training via Multi-Grained Alignment

Published: 29 Jan 2025, Last Modified: 29 Jan 2025WWW 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Track: Semantics and knowledge
Keywords: Multilingual vision-language pre-training, Multi-modal alignment, Cross-lingual Transfer
TL;DR: We propose a Multi-grained Multilingual Vision-Language Pre-training model.
Abstract: Recently, multilingual Vision-Language Pre-training (mVLP) has shown remarkable progress in learning joint representations across different modalities and languages. However, most existing methods learn semantic alignment at a coarse-grained level and fail to capture fine-grained correlations between different languages and modalities. To address this, we propose a Multi-grained Multilingual Vision-Language Pre-training (M$^2$-VLP) model, which aims to learn cross-lingual cross-modal alignment at different semantic granular levels. In cross-lingual interaction, the model learns the global alignment of parallel sentence pairs and the word-level correlations. In cross-modal interaction, the model aligns images with captions and image regions with corresponding words. To integrate the cross-lingual and cross-modal alignment above, we propose a unified multi-grained contrastive learning paradigm. Under zero-shot cross-lingual and fine-tuned multilingual settings, extensive experiments on vision-language downstream tasks across twenty languages demonstrate the effectiveness of M$^2$-VLP over competitive contrastive models.
Submission Number: 461
Loading