A Bilingual Templates Data Augmentation Method for Low-Resource Neural Machine Translation

Published: 01 Jan 2024, Last Modified: 16 Jun 2025ICIC (LNAI 3) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: The transformer-based neural machine translation (NMT) model has achieved remarkable success in the sequence-to-sequence NMT paradigm, exhibiting state-of-the-art performance. However, its reliance on abundant bilingual data resources poses a significant challenge, especially when dealing with scarce parallel sentence pairs. In such scenarios, the translation performance often deteriorates sharply. To alleviate this issue, this paper introduces a novel data augmentation (DA) approach for the NMT model. It leverages bilingual templates to augment the training set, thereby enhancing the generalization ability of the NMT model. Firstly, a template extraction algorithm is devised to generate sentence templates for both the source and target sentences in the original bilingual corpus. Subsequently, two data augmentation strategies are employed to expand the training corpus. By incorporating these augmented data into the training process, the NMT model is exposed to a broader range of linguistic phenomena, enabling it to better handle low-resource scenarios. The experimental results conducted on both simulated and real low-resource translation tasks reveal that the proposed DA approach significantly enhances translation performance. When compared to a robust baseline and several other data augmentation techniques, the proposed method consistently outperforms its counterparts, demonstrating its efficacy and versatility in addressing the challenges posed by limited parallel data.
Loading