The Progressive Alignment-aware Multimodal Fusion with Easy2hard Strategy for Multimodal Neural Machine Translation
Keywords: Multimodal neural machine translation, Multi-modal alignment, Easy2hard, Progressive multi-modal fusion, Multi30K
Abstract: Multimodal neural machine translation (MNMT) aims to improve textual level machine translation performance in the presence of text-related images. Most of the previous works on MNMT have only focused on either multimodal feature fusion or noise multi-modal representations based on full visual and textual features, however, the degree of multi-modal alignment is often ignored. Generally, the fine-grained multi-modal information, such as visual object, textual entity, is easy to align, but the global-level semantic alignment is always difficult. In order to alleviate the challenging problem of multi-modal alignment, this paper proposes a novel progressive multimodal fusion approach with the easy-to-hard (easy2hard) cross-model alignment strategy by fully exploiting visual information for MNMT. We first extract both visual and textual features with modal-specific pre-trained models, respectively, and the fine-grained features (e.g., the regional visual features, the entity features) are roughly aligned as multi-modal anchors based on cross-modal interactive module. Then a progressive multi-modal fusion framework is employed for MNMT by gradually narrowing the global-level multi-modal semantic gap based on the roughly aligned anchors. We validate our method on the Multi30K dataset. The experimental results show the superiority of our proposed model, and achieve the state-of-the-art (SOTA) scores in all En-De, En-Fr and En-Cs translation tasks.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Supplementary Material: zip
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Submission Guidelines: Yes
Please Choose The Closest Area That Your Submission Falls Into: Applications (eg, speech processing, computer vision, NLP)
4 Replies
Loading