Overtrained Language Models Are Harder to Fine-Tune

Jacob Mitchell Springer; Sachin Goyal; Kaiyue Wen; Tanishq Kumar; Xiang Yue; Sadhika Malladi; Graham Neubig; Aditi Raghunathan

Overtrained Language Models Are Harder to Fine-Tune

Jacob Mitchell Springer, Sachin Goyal, Kaiyue Wen, Tanishq Kumar, Xiang Yue, Sadhika Malladi, Graham Neubig, Aditi Raghunathan

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large language models are pre-trained on ever-growing token budgets under the assumption that better pre-training performance translates to improved downstream models. In this work, we challenge this assumption and show that extended pre-training can make models harder to fine-tune, leading to degraded final performance. We term this phenomenon \textbf{catastrophic overtraining}. For example, the instruction-tuned OLMo-1B model pre-trained on 3T tokens leads to over 2\% worse performance on multiple standard LLM benchmarks than its 2.3T token counterpart. Through controlled experiments and theoretical analysis, we show that catastrophic overtraining arises from a systematic increase in the broad sensitivity of pre-trained parameters to modifications, including but not limited to fine-tuning. Our findings call for a critical reassessment of pre-training design that considers the downstream adaptability of the model.

Lay Summary: Recent improvements in artificial intelligence language models have largely been driven by increasing the amount of resources spent training them. Typically, this involves making models larger and giving them more data to learn from. Recently, however, researchers have begun training models far beyond the usual amount of data, aiming to keep models smaller and more efficient for later use. In our research, we identify an unexpected downside of this approach. Although training these models longer continuously improves their basic learning performance, we found that excessively trained models become surprisingly harder to adapt later for specific tasks. Specifically, models trained on very large amounts of data perform worse after adaptation---both on familiar tasks and new ones---compared to those trained more moderately. We also provide theoretical insights to explain this behavior in simplified settings. Overall, our findings suggest an unexpected trade-off: using more resources to train a model initially may actually make it less effective when it needs to be adapted later.

Primary Area: Deep Learning->Large Language Models

Keywords: pretraining, finetuning, catastrophic forgetting, transfer learning

Submission Number: 12319

Loading