Corpus Analysis Vector: A Data-driven Performance-oriented Corpus Quality Assessment For Neural Machine Translation

Corpus Analysis Vector: A Data-driven Performance-oriented Corpus Quality Assessment For Neural Machine Translation

ACL ARR 2025 May Submission3354 Authors

19 May 2025 (modified: 29 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Neural Machine Translation (NMT) employs neural networks to model the probability distribution of the parallel corpus, with advances in network architectures resulting in a substantial enhancement in translation quality. The quality of the parallel corpus is also a significant factor in the translation quality. Despite the broad consensus on the positive correlation between corpus quality and translation quality, existing methods for assessing corpus quality fail to address the quantitative relationship between corpus quality and translation quality. It leads to the fact that corpus quality assessment has to rely on subjective experience or black-box language models to blur the relationship, divorced from the mathematical modeling of NMT. This brings unavoidable bias and unestimated impact to the NMT system. In response to the aforementioned issues, this paper proposes the Corpus Analysis Vector (CAV), a data-driven framework that mathematically formalises corpus quality by converting text sequences into matrices under the modelling of NMT. The paper employs the CAV framework to model the probability distribution of corpus and translation quality, mathematically formalising the relationship in the context of the translation accuracy prediction task. The efficacy of CAV is validated through experimentation on multiple benchmark datasets: CAV demonstrates efficacy in translation accuracy prediction by modelling the quantitative correlation between corpus quality and translation quality. The subsequent case studies are intended to illustrate the interpretability of the CAV in terms of identifying quality-critical corpus features from a data-driven perspective. It has been demonstrated that, in addition to theoretical insights, CAV also has practical utility in guiding corpus filtering, thereby enhancing NMT systems.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: Automatic Evaluation; Data Influence; Evaluation methodologies; Machine Translation;

Contribution Types: Model analysis & interpretability, Data analysis

Languages Studied: English-German (En-De), English-Chinese (En-Zh), German-Upper Sorbian (De-Hsb), and Russian-Chuvash (Ru-Chv).

Submission Number: 3354

Loading