Corpus Analysis Vector: A Data-driven Performance-oriented Corpus Quality Assessment For Neural Machine Translation
Abstract: Neural Machine Translation (NMT) employs neural networks to model the probability distribution of the parallel corpus, with advances in network architectures resulting in a substantial enhancement in translation quality.
The quality of the parallel corpus is also a significant factor in the translation quality.
Despite the broad consensus on the positive correlation between corpus quality and translation quality, existing methods for assessing corpus quality fail to address the quantitative relationship between corpus quality and translation quality.
It leads to the fact that corpus quality assessment has to rely on subjective experience or black-box language models to blur the relationship, divorced from the mathematical modeling of NMT.
This brings unavoidable bias and unestimated impact to the NMT system.
In response to the aforementioned issues, this paper proposes the Corpus Analysis Vector (CAV), a data-driven framework that mathematically formalises corpus quality by converting text sequences into matrices under the modelling of NMT.
The paper employs the CAV framework to model the probability distribution of corpus and translation quality, mathematically formalising the relationship in the context of the translation accuracy prediction task.
The efficacy of CAV is validated through experimentation on multiple benchmark datasets: CAV demonstrates efficacy in translation accuracy prediction by modelling the quantitative correlation between corpus quality and translation quality.
The subsequent case studies are intended to illustrate the interpretability of the CAV in terms of identifying quality-critical corpus features from a data-driven perspective.
It has been demonstrated that, in addition to theoretical insights, CAV also has practical utility in guiding corpus filtering, thereby enhancing NMT systems.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: Automatic Evaluation; Data Influence; Evaluation methodologies; Machine Translation;
Contribution Types: Model analysis & interpretability, Data analysis
Languages Studied: English-German (En-De), English-Chinese (En-Zh), German-Upper Sorbian (De-Hsb), and Russian-Chuvash (Ru-Chv).
Submission Number: 3354
Loading