Scaling Laws for Forgetting during Finetuning with Pretraining Data Injection

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We study how the loss on pretraining datasets and specialization datasets changes with model size, number of available fine-tuning tokens, and amount of pre-training data injected in the mix during fine-tuning.
Abstract: A widespread strategy to obtain a language model that performs well on a target domain is to finetune a pretrained model to perform unsupervised next-token prediction on data from that target domain. Finetuning presents two challenges: \textit{(i)} if the amount of target data is limited, as in most practical applications, the model will quickly overfit, and \textit{(ii)} the model will drift away from the original model, forgetting the pretraining data and the generic knowledge that comes with it. Our goal is to derive scaling laws that quantify these two phenomena for various target domains, amounts of available target data, and model scales. We measure the efficiency of injecting pretraining data into the finetuning data mixture to avoid forgetting and mitigate overfitting. A key practical takeaway from our study is that injecting as little as $1\%$ of pretraining data in the finetuning data mixture prevents the model from forgetting the pretraining set.
Lay Summary: A common approach in LLM is to train on generic data that is abundant but not specific, and then fine-tune on specific data that is scarce. The finetuning stage may provoke a catastrophic forgetting of the generic data. We quantify this forgetting, and we show we can counterbalance it simply by mixing the scarce specific data with generic data during the finetuning stage.
Primary Area: Deep Learning->Foundation Models
Keywords: LLM, fine-tuning, scaling laws
Submission Number: 12087
Loading