Track: long paper (up to 4 pages)
Keywords: pre-training, fine-tuning, catastrophic forgetting, transfer learning
Abstract: Large language models are pre-trained with an ever-increasing token budget operating under the largely unexamined premise that better pre-training performance translates to better downstream performance. In this work, we show that this widely-held assumption is in fact false! Pre-training on extremely large number of tokens eventually makes the model harder to fine-tune leading to worse downstream performance. For instance, after instruction tuning or multimodal fine tuning, OLMo-1B models pre-trained on 3T tokens under perform their 2.3T token counterpart by over $2\%$ on standard LLM benchmarks. Controlled experiments and theoretical analysis show that the phenomenon of catastrophic overtraining is both fundamental and universal. Our results suggest that as token budgets continue to scale, models will experience increasingly severe fine-tuning degradation across a wider range of tasks. This calls for a critical reassessment of pre-training design that takes into account the entire model lifecycle.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Submission Number: 30
Loading