DAS-CL: Towards Multimodal Machine Translation via Dual-Level Asymmetric Contrastive Learning

Published: 21 Oct 2023, Last Modified: 13 Apr 2024OpenReview Archive Direct UploadEveryoneCC BY 4.0
Abstract: Multimodal machine translation (MMT) aims to exploit visual information to improve neural machine translation (NMT). It has been demonstrated that image captioning and object detection can further improve MMT. In this paper, to leverage image captioning and object detection more effectively, we propose a Dual-level ASymmetric Contrastive Learning (DAS-CL) framework. Specifically, we leverage image captioning and object detection to generate more pairs of visual inputs and textual inputs. At the utterance level, we introduce an image captioning model to generate more coarse-grained pairs. At the word level, we introduce an object detection model to generate more fine-grained pairs. To mitigate the negative impact of noise in generated pairs, we apply asymmetric contrastive learning at these two levels. Experiments on the Multi30K dataset of three translation directions demonstrate that DAS-CL significantly outperforms existing MMT frameworks and achieves new state-of-the-art performance. More encouragingly, further analysis displays that DAS-CL is more robust to irrelevant visual information.
Loading