Keywords: safety fine-tuning, catastrophic forgetting, loss landscape curvature, fine-tuning, Hessian analysis, data mixing, continual learning, Taylor expansion
TL;DR: Fine-tuning is fragile because it lands the model in sharply curved loss minima - curvature, not just gradient conflict, drives forgetting, and mixing pre-training data flattens these minima.
Abstract: Fine-tuning on narrow distributions often produces fragile changes
that are easily reversed by further training,
with implications for the durability of safety fine-tuning.
Mixing pre-training data into fine-tuning is a known mitigation,
but why varying the proportion of fine-tuning data
(which we term concentration)
modulates forgetting is poorly understood.
During a reversion phase
(subsequent training on pre-training data after fine-tuning),
we decompose the per-step change in fine-tune loss into
its first- and second-order Taylor terms.
We then track how each varies with concentration.
In experiments on LLMs (Pythia-70M),
we find that the second-order (curvature) term
grows in importance with concentration,
and that this sharpness lies specifically along the reversion update direction,
growing monotonically with concentration.
Curvature can therefore erase fine-tuned behaviour
even when fine-tune and pre-train gradients are not in conflict,
providing empirical support for recent theoretical accounts of curvature-driven forgetting.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 23
Loading