Abstract: Recent breakthroughs in deep learning have accelerated progress toward increasingly capable large language models (LLMs), even sparking discussions about the path to Artificial General Intelligence (AGI). Yet, current LLM training pipelines continue to depend on heuristics and human-driven empirical analysis to curate data. In practice, more sophisticated data selection methods often incur high costs, exhibit limited adaptability, or do not consistently surpass simple random baselines across various models and datasets. In this work, we propose Spaced Scheduled Training (Sst), a novel adaptive data selection strategy that prioritizes training examples based solely on per-example perplexity computed from the model’s own evolving parameters. By obviating the need for external reference models, Sst customizes data selection to the model’s unique characteristics, including its pre-training data composition, and eliminates biases commonly introduced by these external models. Extensive experiments on seven LLMs (0.5B to 32B parameters) in the instruction-finetuning (IFT) setting show that Sst consistently outperforms representative state-of-the-art selection approaches like Deita and InsTag on the Open LLM Leaderboard. For instance, with Qwen2.5-32B and a 30k examples data budget, Sst achieved a 42.75% Open LLM Leaderboard score, exceeding a leading data-selection baseline (38.56%) and the full-100k dataset baseline (39.58%). We further present a theoretical framework to assess computational overhead of model-based selection methods, showing that Sst remains efficient in practical scenarios, and propose strategies to mitigate the overhead in worst-case scenarios. Our findings underscore the potential of model-informed dynamic data selection, offering an efficient, adaptable, and cost-effective approach. We release our training code, trained models, and data mixes in our public repository.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: In response to the AC's comments, we revised the text to clearly indicate that the work focuses on the IFT setting, both in the abstract and in the contributions list.
Supplementary Material: zip
Assigned Action Editor: ~Colin_Raffel1
Submission Number: 4384
Loading