Filtered Pseudo-parallel Corpus Improves Low-resource Neural Machine Translation

Aizhan Imankulova, Takayuki Sato, Mamoru Komachi

2020 (modified: 15 Nov 2021)ACM Trans. Asian Low Resour. Lang. Inf. Process. 2020Readers: Everyone

Abstract: Large-scale parallel corpora are essential for training high-quality machine translation systems; however, such corpora are not freely available for many language translation pairs. Previously, training data has been augmented by pseudo-parallel corpora obtained by using machine translation models to translate monolingual corpora into the source language. However, in low-resource language pairs, in which only low-accurate machine translation systems can be used, translation quality degrades when a pseudo-parallel corpus is naively used. To improve machine translation performance with low-resource language pairs, we propose a method to effectively expand the training data via filtering the pseudo-parallel corpus using quality estimation based on sentence-level round-trip translation. For experiments with three language pairs that utilized small, medium, and large size parallel corpora, BLEU scores significantly improved for low-resource language pairs. Additionally, the effects of iterative bootstrapping on translation performance quality is investigated; resultingly, it is confirmed that bootstrapping can further improve the translation performance.

0 Replies