Abstract: Data quality and its effective selection are fundamental to improving the performance of machine translation models, serving as cornerstones for achieving robust and reliable translation systems.
This paper presents a data selection methodology specifically designed for fine-tuning machine translation systems, which leverages the synergy between a learner model and a pre-trained reference model to enhance overall training effectiveness.
By defining a learnability score, our approach systematically evaluates the utility of data points for training, ensuring that only the most relevant and impactful examples contribute to the fine-tuning process.
Furthermore, our method employs a batch selection strategy which considers interdependencies among data points, optimizing the efficiency of the training process while maintaining a focus on data relevance.
Experiments on English-to-Persian translation using an mBART model fine-tuned on the CCMatrix dataset demonstrate that our method achieves a fivefold improvement in data efficiency compared to an iid baseline.
Experimental results indicate that our approach improves computational efficiency by 24% when utilizing cached embeddings, as it requires fewer training data points. Additionally, it enhances generalization, resulting in superior translation performance compared to iid methods.
Paper Type: Short
Research Area: Efficient/Low-Resource Methods for NLP
Research Area Keywords: Efficient Machine Translation
Contribution Types: Approaches to low-resource settings, Approaches low compute settings-efficiency, Data resources
Languages Studied: English Persian
Submission Number: 7215
Loading