Abstract: To reduce ambiguity and semantic distortion when translating a text, current dominant methods focus on integrating features from multi-modalities, such as text and image. However, this indiscriminate integration neglects inherent differences in modalities, introducing noise that adversely affects translation. To overcome this challenge, we propose a model, FINE-LMT, to learn common and specific features from the modalities. To recognize the common features between text and image modalities, we employ contrastive learning to enhance the distinction between common and specific features. Additionally, we utilize an orthogonal loss to ensure clear distinction from extracted common features when extracting specific features from the text modality. By fusing common and specific features, FINE-LMT surpasses advanced MMT methods and demonstrates effective integration with pre-trained language models, achieving BLEU score improvements of 0.98% and 1.06% for En\(\rightarrow \)De/Fr translation tasks, and improvements of 0.61% and 0.78% when integrated with pre-trained models, all averaged across three benchmarks.
Loading