Sample Ordering and Selection Both Matter: A Case Study on the Impact of Sample Ordering in Active Learning for Translation
Abstract: Active learning (AL) is a technique for efficiently selecting subsets of data for annotation and fine-tuning, which has been shown to outperform random sampling in classification tasks. However, it remains unclear why applying similar strategies does not consistently lead to similar gains in performance on natural language generation tasks. We hypothesize that previous methods underperform random sampling as they rarely consider interactions between the selected samples, and thus overlook training dynamics which may impact model performance. We find that in machine translation (MT), the ordering of the samples has a significant impact on performance, and show that fine-tuning the model on multiple shuffles of the data can allow AL to outperform random sampling in cases where it previously did not. We then present ways in which some shuffles of the training data learn the task of MT suboptimally, to motivate future AL strategies to explicitly account for training dynamics and mitigate these failure modes.
Paper Type: Long
Research Area: Efficient/Low-Resource Methods for NLP
Research Area Keywords: NLP in resource-constrained settings, data-efficient training
Contribution Types: Approaches to low-resource settings
Languages Studied: English, German, Afrikaans, Filipino, Hatian Creole
Submission Number: 2015
Loading