FINE-LMT: Fine-Grained Feature Learning for Multi-modal Machine Translation

Yusong Wang, Ying Zhang, Dongyuan Li, Jialun Shen, Yicheng Xu, Mingkun Xu, Kotaro Funakoshi, Manabu Okumura

Published: 2024, Last Modified: 07 Jan 2026PRICAI (2) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: To reduce ambiguity and semantic distortion when translating a text, current dominant methods focus on integrating features from multi-modalities, such as text and image. However, this indiscriminate integration neglects inherent differences in modalities, introducing noise that adversely affects translation. To overcome this challenge, we propose a model, FINE-LMT, to learn common and specific features from the modalities. To recognize the common features between text and image modalities, we employ contrastive learning to enhance the distinction between common and specific features. Additionally, we utilize an orthogonal loss to ensure clear distinction from extracted common features when extracting specific features from the text modality. By fusing common and specific features, FINE-LMT surpasses advanced MMT methods and demonstrates effective integration with pre-trained language models, achieving BLEU score improvements of 0.98% and 1.06% for En\(\rightarrow \)De/Fr translation tasks, and improvements of 0.61% and 0.78% when integrated with pre-trained models, all averaged across three benchmarks.