An image speaks a thousand words, but can everyone listen? On translating images for cultural relevance
Abstract: Given the rise of multimedia content, human translators increasingly focus on culturally adapting not only words but also other modalities such as images to convey the same meaning. While several applications stand to benefit from this, machine translation systems remain confined to language in speech and text. In this work, we take a first step towards translating images to make them culturally relevant. First, we build three pipelines comprising state-of-the-art generative models to do the task. Next, we build a two-part evaluation dataset -- (i) concept: comprising 600 images that are cross-culturally coherent, focusing on a single concept per image; and (ii) application: comprising 100 images curated from real-world applications. We conduct a multi-faceted human evaluation of translated images to assess for cultural relevance and meaning preservation. We find that as of today, image-editing models fail at this task, but can be improved by leveraging LLMs and retrievers in the loop. Best models can only translate 6% of images for some countries in the easier concept dataset and no translation is successful for some countries in the application dataset, highlighting the challenging nature of the task. Our code and data is released here: https://anonymous.4open.science/r/image-translation-6980.
Paper Type: long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data resources
Languages Studied: English
Preprint Status: We plan to release a non-anonymous preprint in the next two months (i.e., during the reviewing process).
A1: yes
A1 Elaboration For Yes Or No: Section 7
A2: yes
A2 Elaboration For Yes Or No: Section 8
A3: yes
A3 Elaboration For Yes Or No: Section 1
B: yes
B1: yes
B1 Elaboration For Yes Or No: We curate application-based data from math worksheets and stories. Links are provided in Section 3.2.
B2: yes
B2 Elaboration For Yes Or No: We discuss this in the last point of Section 8
B3: yes
B3 Elaboration For Yes Or No: We discuss this in the last point of Section 8
B4: yes
B4 Elaboration For Yes Or No: No personally identifiable information has been collected
B5: yes
B5 Elaboration For Yes Or No: Section 3
B6: yes
B6 Elaboration For Yes Or No: Section 3
C: yes
C1: yes
C1 Elaboration For Yes Or No: Section 2
C2: yes
C2 Elaboration For Yes Or No: Section 2
C3: yes
C3 Elaboration For Yes Or No: We work with image generation and report human evaluation ratings in Section 4
C4: n/a
D: yes
D1: yes
D1 Elaboration For Yes Or No: Appendix B
D2: yes
D2 Elaboration For Yes Or No: Appendix B
D3: yes
D3 Elaboration For Yes Or No: Appendix B
D4: yes
D4 Elaboration For Yes Or No: Appendix B
D5: yes
D5 Elaboration For Yes Or No: Section 4; we recruit 2 participants per country for the human evaluation.
E: no
E1: n/a
0 Replies
Loading