Quantifying the Gap Between Machine Translation and Native Language in Training for Multimodal, Multilingual Retrieval

ACL ARR 2024 June Submission404 Authors

10 Jun 2024 (modified: 12 Aug 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: There is a scarcity of multilingual vision-language models that properly account for the perceptual differences that are reflected in image captions across languages and cultures. The existing lack of model flexibility is shown in a performance gap between training on independently written English and German captions in German text-image retrieval. In this work, we first show that using off-the-shelf machine translation is ineffective at bridging this gap. Second, we propose techniques to reduce the drop off from training on native German captions. Third, we show that part of the gap remains, which identifies an open area in which we encourage future work from the community.
Paper Type: Short
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: image text matching, cross-modal pretraining
Contribution Types: Data analysis
Languages Studied: English, German
Submission Number: 404
Loading