VisDNMT: Improving Neural Machine Translation via Visual Knowledge DistillationDownload PDF

Anonymous

16 Dec 2023ACL ARR 2023 December Blind SubmissionReaders: Everyone
TL;DR: VisDNMT distills visual knowledge from a pre-trained multilingual visual-language model to help text-only translation without using paired images
Abstract: Multi-modal machine translation (MMT) is the research field that aims to improve neural machine translation (NMT) models with visual knowledge. While existing MMT systems achieve promising performance over text-only NMT methods, they typically require paired text and image as input, which limits their applicability to general translation tasks. To benefit general translation with visual knowledge, we propose VisDNMT, which distills visual knowledge from a pre-trained multilingual visual-language model to help translation. In particular, we train a transformer-based model jointly with a standard cross-entropy loss for translation and a knowledge distillation (KD) objective that aligns its language embedding with vision contextualized language embedding of the teacher model. VisDNMT achieves consistently higher gains over text-only NMT baselines, compared to state-of-art methods on rich and sparse visually grounded text.
Paper Type: short
Research Area: Machine Translation
Languages Studied: English,German,French
0 Replies

Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview