Sample Ordering and Selection Both Matter: A Case Study on the Impact of Sample Ordering in Active Learning for Translation

Sample Ordering and Selection Both Matter: A Case Study on the Impact of Sample Ordering in Active Learning for Translation

ACL ARR 2025 May Submission2015 Authors

18 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Active learning (AL) is a technique for efficiently selecting subsets of data for annotation and fine-tuning, which has been shown to outperform random sampling in classification tasks. However, it remains unclear why applying similar strategies does not consistently lead to similar gains in performance on natural language generation tasks. We hypothesize that previous methods underperform random sampling as they rarely consider interactions between the selected samples, and thus overlook training dynamics which may impact model performance. We find that in machine translation (MT), the ordering of the samples has a significant impact on performance, and show that fine-tuning the model on multiple shuffles of the data can allow AL to outperform random sampling in cases where it previously did not. We then present ways in which some shuffles of the training data learn the task of MT suboptimally, to motivate future AL strategies to explicitly account for training dynamics and mitigate these failure modes.

Paper Type: Long

Research Area: Efficient/Low-Resource Methods for NLP

Research Area Keywords: NLP in resource-constrained settings, data-efficient training

Contribution Types: Approaches to low-resource settings

Languages Studied: English, German, Afrikaans, Filipino, Hatian Creole

Submission Number: 2015

Loading