Keywords: Data Selection, Adaptive Sampling
Abstract: Selecting high-quality training data can substantially reduce the computational
cost of instruction-tuning language models, as carefully curated datasets often
yield models that outperform those trained on much larger, noisier corpora. Most
existing automated data selection methods for instruction tuning, however, operate
in a single step and remain static throughout training. Inspired by ideas from
active learning, we study iterative data selection for instruction tuning, where the
training subset is updated over multiple iterations. To mitigate the computational
overhead typically associated with large language models, we further show that
a significantly smaller model can be used to guide data selection at negligible
cost while remaining competitive on downstream tasks. Through a case study on
LLaMA 3 8B (Grattafiori et al., 2024) , we demonstrate that our adaptive selection
algorithm consistently matches or outperforms random selection across a diverse
suite of downstream benchmarks, while using fewer training examples.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 95
Loading