Dynamic Jointly Batch Selection for Data Efficient Machine Translation Fine-Tuning

ACL ARR 2025 February Submission7215 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Data quality and its effective selection are fundamental to improving the performance of machine translation models, serving as cornerstones for achieving robust and reliable translation systems. This paper presents a data selection methodology specifically designed for fine-tuning machine translation systems, which leverages the synergy between a learner model and a pre-trained reference model to enhance overall training effectiveness. By defining a learnability score, our approach systematically evaluates the utility of data points for training, ensuring that only the most relevant and impactful examples contribute to the fine-tuning process. Furthermore, our method employs a batch selection strategy which considers interdependencies among data points, optimizing the efficiency of the training process while maintaining a focus on data relevance. Experiments on English-to-Persian translation using an mBART model fine-tuned on the CCMatrix dataset demonstrate that our method achieves a fivefold improvement in data efficiency compared to an iid baseline. Experimental results indicate that our approach improves computational efficiency by 24% when utilizing cached embeddings, as it requires fewer training data points. Additionally, it enhances generalization, resulting in superior translation performance compared to iid methods.
Paper Type: Short
Research Area: Efficient/Low-Resource Methods for NLP
Research Area Keywords: Efficient Machine Translation
Contribution Types: Approaches to low-resource settings, Approaches low compute settings-efficiency, Data resources
Languages Studied: English Persian
Submission Number: 7215
Loading