Visual Pivoting Unsupervised Multimodal Machine Translation in Low-Resource Distant Language Pairs

Published: 2024, Last Modified: 22 Jan 2026EMNLP (Findings) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Unsupervised multimodal machine translation (UMMT) aims to leverage vision information as a pivot between two languages to achieve better performance on low-resource language pairs. However, there is presently a challenge: how to handle alignment between distant language pairs (DLPs) in UMMT. To this end, this paper proposes a visual pivoting UMMT method for DLPs. Specifically, we first construct a dataset containing two DLPs, including English-Uyghur and Chinese-Uyghur. We then apply the visual pivoting method for both to pre-training and fine-tuning, and we observe that the images on the encoder and decoder of UMMT have noticeable effects on DLPs. Finally, we introduce informative multi-granularity image features to facilitate further alignment of the latent space between the two languages. Experimental results show that the proposed method significantly outperforms several baselines on DLPs and close language pairs (CLPs).
Loading