Digging Errors in NMT: Evaluating and Understanding Model Errors from Hypothesis Distribution

Anonymous

Digging Errors in NMT: Evaluating and Understanding Model Errors from Hypothesis Distribution

Anonymous

17 Sept 2021 (modified: 05 May 2023)ACL ARR 2021 September Blind SubmissionReaders: Everyone

Abstract: Sound evaluation of a neural machine translation (NMT) model is key to its understanding and improvement. Current evaluation of an NMT system is usually built upon a heuristic decoding algorithm (e.g., beam search) and an evaluation metric assessing similarity between the translation and golden reference (e.g., BLEU). However, this system-level evaluation framework is prone to its evaluation over only one best hypothesis and search errors brought by heuristic decoding algorithms. To better understand NMT models, we propose a novel evaluation protocol, which defines model errors with hypothesis distribution. In particular, we first propose an exact top-$k$ decoding algorithm, which finds top-ranked hypotheses in the whole hypothesis space and avoids search errors. Then, we evaluate NMT model errors with the distance between hypothesis distribution with the ideal distribution, aiming for a comprehensive interpretation. We apply our evaluation on various NMT benchmarks and model architectures to provide an in-depth understanding of how NMT models work. We show that the state-of-the-art Transformer models are facing serious ranking errors and do not even outperform the random chance level. We further provide several interesting findings over data-augmentation techniques, dropouts, and deep/wide models. Additionally, we analyze beam search's lucky biases and regularization terms. Interestingly, we find these lucky biases decrease when increasing model capacity.

0 Replies

Loading