Retrieval-Augmented Machine Translation with Unstructured Knowledge

Retrieval-Augmented Machine Translation with Unstructured Knowledge

ACL ARR 2024 December Submission137 Authors

09 Dec 2024 (modified: 05 Feb 2025)ACL ARR 2024 December SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Retrieval-augmented generation (RAG) introduces additional information to enhance large language models (LLMs). In machine translation (MT), previous work typically retrieves in-context examples from paired MT corpora, or domain-specific knowledge from knowledge graphs, to enhance MT models. However, a large amount of world knowledge is organized in unstructured documents, and might not be fully paired across different languages. In this paper, we study retrieval-augmented MT using unstructured documents. Specifically, we build RAGtrans, the first benchmark to train and evaluate LLMs' retrieval-augmented MT ability. RAGtrans contains 79K MT samples collected via GPT-4o and human translators. Besides, documents from different languages are also provided to supply the knowledge to these samples. Based on RAGtrans, we further propose a multi-task training method to teach LLMs how to use information from multilingual documents during their translation. The method uses existing multilingual corpora to create auxiliary training objectives without additional labeling requirements. Extensive experiments show that the method improves LLMs by 1.58-3.09 BLEU and 1.00-2.03 COMET scores. We also conclude the critical difficulties that current LLMs face with this task.

Paper Type: Long

Research Area: Machine Translation

Research Area Keywords: machine translation, multilingualism, multilingual benchmarks, multilingual evaluation

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Reproduction study, Data resources, Data analysis

Languages Studied: English, Chinese

Submission Number: 137

Loading