A Clause-Based Data Augmentation Method for Low-Resource Neural Machine Translation

Published: 01 Jan 2025, Last Modified: 16 Jun 2025IEEE Access 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Transformer-based neural machine translation (NMT) systems have achieved remarkable success with high-resource bilingual corpora. However, their performance deteriorates significantly in low-resource environments due to the scarcity of training data. To mitigate this issue, this paper proposes a novel clause-based data augmentation (DA) approach for NMT, aimed at expanding the training set by leveraging valuable information from the original data. The proposed method commences with the development of a clause extraction algorithm to extract clauses from the target sentences. Subsequently, a target-to-source language NMT model is utilized to generate translations for these clauses. To further enrich the training set, two DA strategies are employed. The efficacy of the proposed approach is validated through experiments conducted on four open translation tasks with limited resources. Experimental results demonstrate that our method consistently outperforms the baseline model and several other DA approaches, highlighting its potential to improve the translation quality in low-resource scenarios.
Loading