Machine Translation of Cooking Videos Using Descriptions of the Images by Chain-of-Thought Augmentation

AACL-IJCNLP 2025 Workshop SRW ARR Commitment Submission4 Authors

28 Oct 2025 (modified: 31 Oct 2025)Submitted to IJCNLP-AACL 2025 SRW (ARR Commitment)EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Machine Translation, Multi-Modal
Submission Category: Short Paper
TLDR: We adopt a Chain-of-Thought Augmentation (CoTA) approach, where the model generates descriptions of images and utilizes them as auxiliary information for the translation task.
Abstract: English cooking videos often contain polysemous words and omitted expressions, making accurate translation challenging. This study aims to improve English-Japanese machine translation of cooking videos by utilizing images extracted from the video. We adopt a Chain-of-Thought Augmentation (CoTA) approach, where the model generates descriptions of images and utilizes them as auxiliary information for the translation task. In our experiments, we selected sentences from an English-Japanese cooking video corpus that were difficult to translate due to polysemous words. We evaluated the performance using GPT-4o and Qwen2-VL with COMET and BLEU scores. The results demonstrate that incorporating images improves translation accuracy, with a particularly strong tendency for CoTA applied to GPT-4o to produce more accurate translations.
Student Status Proof: pdf
Paper Link: https://openreview.net/forum?id=LDlBULBUmX
Submission Number: 4
Loading