Has Machine Translation Achieved Human Parity? A Case for Document-level Evaluation

Samuel Läubli, Rico Sennrich, Martin Volk

19 Jun 2018 (modified: 25 Sept 2018)OpenReview Anonymous Preprint Blind SubmissionReaders: Everyone

Abstract: Recent research suggests that neural machine translation achieves parity with professional human translation on the WMT Chinese--English news translation task. We empirically test this claim with alternative evaluation protocols, contrasting the evaluation of single sentences and entire documents. In a pairwise ranking experiment, human raters assessing adequacy and fluency show a stronger preference for human over machine translation when evaluating documents as compared to isolated sentences. Our findings emphasise the need to shift towards document-level evaluation as machine translation improves to the degree that errors which are hard or impossible to spot at the sentence-level become decisive in discriminating quality of different translation outputs.

Keywords: machine translation, evaluation

TL;DR: Raters prefer adequacy in human over machine translation when evaluating entire documents, but not when evaluating single sentences.

0 Replies