Regressing Word and Sentence Embeddings for Low-Resource Neural Machine Translation

Published: 01 Jan 2023, Last Modified: 17 Jul 2025IEEE Trans. Artif. Intell. 2023EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: In recent years, neural machine translation (NMT) has achieved unprecedented performance in the automated translation of resource-rich languages. However, it has not yet managed to achieve a comparable performance over the many low-resource languages and specialized translation domains, mainly due to its tendency to overfit small training sets and consequently strive for new data. For this reason, in this article, we propose a novel approach to regularize the training of NMT models to improve their performance over low-resource language pairs. In the proposed approach, the model is trained to copredict the target training sentences both as the usual categorical outputs (i.e., sequences of words) and as word and sentence embeddings. The fact that word and sentence embeddings are pretrained over large corpora of monolingual data helps the model generalize beyond the available translation training set. Extensive experiments over three low-resource language pairs have shown that the proposed approach has been able to outperform strong state-of-the-art baseline models, with more marked improvements over the smaller training sets (e.g., up to $+6.57$ BLEU points in Basque–English translation). A further experiment on unsupervised NMT has also shown that the proposed approach has been able to improve the quality of machine translation even with no parallel data at all.
Loading