CaMMT: Benchmarking Culturally Aware Multimodal Machine Translation

ACL ARR 2025 May Submission865 Authors

15 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Cultural content poses challenges for machine translation systems due to the differences in conceptualizations between cultures, where language alone may fail to convey sufficient context to capture region-specific meanings. In this work, we investigate whether images can act as cultural context in multimodal translation. We introduce \textsc{CaMMT}, a human-curated benchmark of over 5,800 triples of images along with parallel captions in English and regional languages. Using this dataset, we evaluate five Vision Language Models (VLMs) in text-only and text+image settings. Through automatic and human evaluations, we find that visual context generally improves translation quality, especially in handling Culturally-Specific Items (CSIs), disambiguation, and correct gender usage. By releasing \textsc{CaMMT}, we aim to support broader efforts in building and evaluating multimodal translation systems that are better aligned with cultural nuance and regional variation.
Paper Type: Long
Research Area: Machine Translation
Research Area Keywords: machine translation, multimodality, multimodal applications, multilingual MT, multilingual corpora
Contribution Types: Model analysis & interpretability, Data resources
Languages Studied: Amharic, Arabic, Bengali, Bulgarian, Chinese, Filipino, Igbo, Indonesian, Japanese, Korean, Malay, Marathi, Oromo, Portuguese, Russian, Spanish, Swahili, Tamil, Urdu, English
Submission Number: 865
Loading