Sourcing trustworthy documents for training contextual machine translation systems

Anonymous

Sourcing trustworthy documents for training contextual machine translation systems

Anonymous

16 Dec 2023ACL ARR 2023 December Blind SubmissionReaders: Everyone

TL;DR: Web-crawled parallel text is not a good source of document annotations; use back-translated monolingual data instead

Abstract: Despite the fact that document context is known to be vital for resolving a range of translation ambiguities, most machine translation systems continue to be trained and to operate at the sentence level. A common explanation is the lack of document-level annotations for existing training data. In this paper, we investigate whether having such annotations would be helpful, even with the knowledge that much of bitext mined from the web may have been translated poorly by humans or by (sentence-level) MT. Working with large-scale parallel and monolingual data sets that we produced in-house, we build large-scale contextual MT systems into German, French, and Russian. We find that contextual MT systems benefit most when document samples are constructed from high-quality back-translated monolingual data only. We also show that these improvements are only visible when the systems are evaluated on their generative ability on dense test sets, as opposed to contrastive discrimination between good and bad examples. The results confirm our suspicion that bitext crawled from the web may be of a quality that is too low to reliably maintain contextual cues for training MT.

Paper Type: long

Research Area: Machine Translation

Contribution Types: Model analysis & interpretability, NLP engineering experiment

Languages Studied: English, French, German, and Russian

0 Replies

Loading