Scheduling data improves fine-tuning data efficiency

20 Sept 2025 (modified: 21 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: language models, pre-training, mid-training, post-training, data order, rare data, data efficiency
TL;DR: Fine-tuning is the standard data schedule for leveraging target data and we show that it benefits from either replaying generic data at the end of training or mixing target data at the start of training.
Abstract: To train a language model for a target domain with a limited amount of data (e.g. math), the current paradigm is to pre-train on a vast amount of generic web text and then fine-tune on the target data. Since standard fine-tuning uses a data schedule of keeping all generic data before all target data, we ask how much we can improve performance on the target domain via adding generic data to the end of training or target data to the start. In a controlled pre-training environment, we first show that simply replaying generic data while fine-tuning, though typically used to reduce catastrophic forgetting of the generic domain, can surprisingly improve performance on the target domain. We then merge the two stages of pre-training and fine-tuning into a single learning rate schedule to establish a mid-training baseline that better leverages the target data. Under this merged learning rate schedule, we search over two stage data schedules that additionally move target data earlier in training. After composing our three interventions, we estimate that standard fine-tuning would need up to 15.86x more data to match the target performance of our best data schedule. We test our findings at scale by showing how replay improves performance for larger models on downstream tasks, improving agentic web navigation success by 4.5\% and Basque question-answering accuracy by 2\%.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 22424
Loading