Research on the Evaluation of Token Imbalance Degree of NMT CorpusDownload PDF


16 Nov 2021 (modified: 05 May 2023)ACL ARR 2021 November Blind SubmissionReaders: Everyone
Abstract: As a kind of classifier, neural machine translation (NMT) is known to perform better with balanced tokens during training. Studying the token distribution in NMT corpus is of guiding significance to improve its quality and the translation effect. Due to the existing researches on token imbalance degree have deficiencies in algorithm performance and word segmentation scope, we propose the Dispersion of Token Distribution (DTD) algorithm, and use it to evaluate corpus from three segmentation levels: character, subword and word. Our experiments show that this algorithm has an improvement in accuracy, effectiveness and robustness. Meanwhile, we find that the token imbalance degree of NMT corpus varies greatly at different segmentation levels, among which character has the highest, word has the lowest and subword is in between. In addition, we also find the regularities of token imbalance degree in languages German (DE), English (EN), French (FR) and Russian (RU).
0 Replies
