Beyond Language: Empowering Unsupervised Machine Translation with Cross-modal Alignment

21 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: representation learning for computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: Neural machine translation; Unsupervised machine translation; Multi-modal machine translation; Cross-modal alignment
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Abstract: Unsupervised machine translation (UMT) has achieved notable performance without any parallel corpora in recent years. Nevertheless, aligning the source language with the target language in the latent space remains a challenge for UMT. While different languages may exhibit variations in their textual representations, they often share a common visual description. Taking inspiration from this, in this paper, we propose a novel unsupervised multi-modal machine translation method using images as pivots to align different languages. Specifically, we introduce cross-modal contrastive learning to achieve sentence-level and token-level alignment. By leveraging monolingual image-text pairs, we align both the source and target languages in a shared semantic space using images as intermediaries, thus achieving source-to-target alignment. Experimental results demonstrate that our approach can effectively learn the source-to-target alignment with monolingual data only and achieves significant improvements over state-of-the-art methods.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 3326
Loading