Advancing Low-Resource Machine Translation: A Unified Data Selection and Scoring Optimization Framework

Zhixiang Lu, Peichen Ji, Yulong Li, Ding Sun, Chenyu Xue, Haochen Xue, Mian Zhou, Angelos Stefanidis, Jionglong Su, Zhengyong Jiang

Published: 2025, Last Modified: 04 Jan 2026ICIC (24) 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Large language models (LLMs) have achieved remarkable success in machine translation, yet their performance on low-resource language pairs remains limited due to data scarcity and poor generalization. In this work, we propose the Unified Data Selection and Scoring Optimization (UDSSO) framework, a novel system that leverages LLMs for high-quality data augmentation and filtering, specifically tailored for low-resource translation. UDSSO integrates scalable data scoring and selection mechanisms to construct improved training corpora, which we apply to fine-tune a compact multilingual model, mBART. We focus on the challenging Chinese-Dutch translation task, a previously underexplored low-resource setting. Our experiments demonstrate that mBART trained with UDSSO-processed data significantly outperforms state-of-the-art (SOTA) LLMs such as GPT-4o and Deepseek-v3, both in translation accuracy and linguistic consistency. This finding highlights the power of strategically enhanced datasets in maximizing the performance of smaller models, offering a cost-effective and efficient alternative to offering a cost-effective and efficient alternative to using large-scale LLM inference. Our framework sets a new performance benchmark for Chinese-Dutch translation and provides a generalizable solution for improving LLM-based translation in low-resource scenarios.

External IDs:dblp:conf/icic/LuJLSXXZSSJ25