Who Watches the Watchmen? Humans Disagree With Translation Metrics on Unseen Domains

ACL ARR 2026 January Submission8770 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: automatic evaluation, biases, domain adaptation, human evaluation
Abstract: Automatic evaluation metrics are central to the development of machine translation systems, yet their robustness under domain shift remains unclear. Most metrics are developed on the Workshop on Machine Translation (WMT) benchmarks, raising concerns about their robustness to unseen domains. Prior studies that analyze unseen domains vary translation systems, annotators, or evaluation conditions, confounding domain effects with human annotation noise. To address these biases, we introduce a systematic multi-annotator **C**ross-**D**omain **E**rror-**S**pan-**A**nnotation dataset (CD-ESA), of 4.5k human error span annotations from the same five annotators and the same six translation systems across one seen news domain and two unseen technical domains. Using this dataset, we first find that automatic metrics appear surprisingly robust to domain-shifts at the segment level (0.47–0.61 agreement), but this robustness largely disappears once we account for human label variation. Averaging annotations increases inter-annotator agreement by up to +0.11. Metrics struggle on the unseen chemical domain compared to humans (inter-annotator agreement of 0.78–0.83 vs. 0.96). Meaningful claims of cross-domain robustness require comparison to inter-annotator agreement and we recommend our standard evaluation setup for future evals.
Paper Type: Long
Research Area: Machine Translation
Research Area Keywords: automatic evaluation, biases, domain adaptation, human evaluation
Contribution Types: Model analysis & interpretability, Approaches to low-resource settings, Data resources
Languages Studied: English, German
Submission Number: 8770
Loading