A Semantic Uncertainty Sampling Strategy for Back-Translation in Low-Resources Neural Machine Translation

Published: 22 Jun 2025, Last Modified: 22 Jun 2025ACL-SRW 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Neural Machine Translation and Low-Resources and Semantic Uncertainty and Back-Translation
Abstract: Back-translation has been proven effective in enhancing the performance of Neural Machine Translation (NMT), with its core mechanism relying on synthesizing parallel corpora to strengthen model training. However, while traditional back-translation methods alleviate the data scarcity in low-resource machine translation, their dependence on random sampling strategies ignores the semantic quality of monolingual data. This results in the contamination of model training through the inclusion of substantial low-quality samples in the generated corpora. To mitigate noise interference, additional training iterations or model scaling are required, significantly increasing computational costs. To address this challenge, this study proposes a Semantic Uncertainty Sampling strategy, which prioritizes sentences with higher semantic uncertainty as training samples by computationally evaluating the complexity of unannotated monolingual data. Experiments were conducted on three typical low-resource agglutinative language pairs: Mongolian-Chinese, Uyghur-Chinese, and Korean-Chinese. Results demonstrate an average BLEU score improvement of +1.7 on test sets across all three translation tasks, confirming the method’s effectiveness in enhancing translation accuracy and fluency. This approach provides a novel pathway for the efficient utilization of unannotated data in low-resource language scenarios.
Archival Status: Archival
Paper Length: Long Paper (up to 8 pages of content)
Submission Number: 110
Loading