Generative evaluation for contextual machine translation

ACL ARR 2024 June Submission1605 Authors

14 Jun 2024 (modified: 14 Aug 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Despite the fact that context is known to be vital for resolving a range of translation ambiguities, most traditional machine translation systems continue to be trained and to operate at the sentence level. This limitation is an inherent performance ceiling that is increasingly glaring compared to their natively-contextual LLM counterparts. A common explanation is the lack of document-level annotations for existing training data. This work investigates whether having such annotations would be helpful for training traditional MT systems at scale. Working with a private parallel and monolingual data set, we build large-scale, state-of-the-art contextual MT systems into German, French, and Russian. We find that these systems are harmed when including contextual training examples sourced from mined parallel bitext. We also show that these improvements are invisible when using contrastive score-based test sets; instead, models must be tested directly on their ability to generate correct outputs, or with standard metrics on discourse-dense test sets. This provides evidence that mined parallel bitext does not contain reliable contextual signals---perhaps because it was translated in a sentence-level manner. Where possible, we repeat our results on public data.
Paper Type: Long
Research Area: Machine Translation
Research Area Keywords: contextual mt, evaluation, contextual evaluation, data quality
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data analysis
Languages Studied: English, French, German, Russian
Submission Number: 1605
Loading