Video-guided Multimodal Machine Translation: A Survey of Models, Datasets, and Challenges

ACL ARR 2025 February Submission8007 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: In recent years, machine translation has evolved with the integration of multimodal information. Infusion of multi-modality into translation tasks decreases the disambiguation and enhances translation scores. Common modalities include images, speech, and videos, which provide additional context alongside the text to be translated. While multimodal translation with images has been extensively studied, video-guided machine translation (VMT) has gained increasing attention, particularly since (Wang et. al. 2019) first explored this task. In this paper, we provide a comprehensive overview of VMT, highlighting its unique challenges, methodologies, and recent advancements. Unlike previous surveys that primarily focus on image-based multimodal translation, this work explores the distinct complexities and opportunities introduced by video as a modality.
Paper Type: Short
Research Area: Machine Translation
Research Area Keywords: Multimodality, Machine Translation, Video-guided Machine Translation
Contribution Types: Surveys
Languages Studied: English, Chinese
Submission Number: 8007
Loading