Abstract: As a kind of classifier, neural machine translation (NMT) is known to perform better with balanced tokens during training. Studying the token distribution in NMT corpus is of guiding significance to improve its quality and the translation effect. Due to the existing researches on token imbalance degree have deficiencies in algorithm performance and word segmentation scope, we propose the Dispersion of Token Distribution (DTD) algorithm, and use it to evaluate corpus from three segmentation levels: character, subword and word. Our experiments show that this algorithm has an improvement in accuracy, effectiveness and robustness. Meanwhile, we find that the token imbalance degree of NMT corpus varies greatly at different segmentation levels, among which character has the highest, word has the lowest and subword is in between. In addition, we also find the regularities of token imbalance degree in languages German (DE), English (EN), French (FR) and Russian (RU).
0 Replies
Loading