Neural Machine Translation for Agglutinative Languages via Data Rejuvenation

Published: 22 Jun 2025, Last Modified: 22 Jun 2025ACL-SRW 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Neural Machine Translation, Agglutinative Languages, Data Rejuvenation, Inactive Samples, Low-Resource Translation
TL;DR: This paper proposes a data rejuvenation framework for agglutinative NMT. It identifies inactive samples via multi-dimensional metrics and reactivates them through target-side augmentation, achieving +2.1-3.4 BLEU gains in multilingual tasks.
Abstract: In Recent years, advances in Neural Machine Translation (NMT) heavily rely on large-scale parallel corpora. Within the context of China's Belt and Road Initiative, there is increasing demand for improving translation quality from agglutinative languages (e.g., Mongolian, Arabic) to Chinese. However, the translation scenarios for agglutinative languages (which form words by concatenating morphemes with clear boundaries) face significant challenges including data sparsity, quality imbalance, and inactive sample proliferation due to their morphological complexity and syntactic flexibility. This study presents a systematic analysis of data distribution characteristics in agglutinative languages and proposes a dual-module framework combining fine-grained inactive sample identification with target-side rejuvenation. Our framework first establishes a multi-dimensional evaluation system to accurately identify samples exhibiting low-frequency morphological interference or long-range word order mismatches. Subsequently, the target-side rejuvenation mechanism generates diversified noise-resistant translations through iterative optimization of sample contribution weights. Experimental results on four low-resource agglutinative language tasks demonstrate significant performance improvements (BLEU +2.1--3.4) across mainstream NMT architectures. Architecture-agnostic validation further confirms the framework's generalizability.
Archival Status: Archival
Paper Length: Long Paper (up to 8 pages of content)
Submission Number: 107
Loading