A Content Word Augmentation Method for Low-Resource Neural Machine Translation

Published: 01 Jan 2023, Last Modified: 16 Jun 2025ICIC (4) 2023EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Transformer-based neural machine translation (NMT) models have achieved state-of-the-art performance in the machine translation community. These models learn the translation knowledge from the parallel corpus through the attention mechanism automatically. However, the model fails to consider the semantic importance of words, where content words play a more important role than functional words in a sentence. This issue is particularly prominent for low-resource translation tasks, where insufficient parallel data results in poor translation quality. To alleviate this issue, a content word augmentation (CWA) method is proposed to improve the encoder for low-resource translation tasks. The main steps are as follows: Firstly, words in a sentence are classified into content and function words based on the content word selection algorithm; Next, two fusion strategies are employed by incorporating the word embedding of content words into the NMT model to augment the encoder. The results of experiments on several translation tasks show that the CWA method outperforms the strong baseline, significantly improving the BLEU score range from 0.24 to 0.57.
Loading