Sharpness-Aware Pretraining Mitigates Catastrophic Forgetting

Ishaan Watts; Catherine Li; Sachin Goyal; Jacob Mitchell Springer; Aditi Raghunathan

Sharpness-Aware Pretraining Mitigates Catastrophic Forgetting

Ishaan Watts, Catherine Li, Sachin Goyal, Jacob Mitchell Springer, Aditi Raghunathan

Published: 02 Mar 2026, Last Modified: 10 Mar 2026ICLR 2026 Workshop ICBINBEveryoneRevisionsCC BY 4.0

Keywords: Pretraining, Optimizers, LLMs, Catastrophic Forgetting

TL;DR: Downstream stability in language models depends on both pre-training loss and loss-landscape sharpness, with increased sharpness leading to greater catastrophic forgetting.

Abstract: Standard optimizer choices for pre-training are designed to minimize pre-training loss. Yet pre-trained models are routinely subjected to further transformations—such as fine-tuning to acquire new capabilities or quantization for efficiency. In this work, we evaluate optimizer choices across model scales, token budgets, and datasets, and find that strategies that explicitly (Sharpness-Aware Minimization) or implicitly (large learning rates and Warmup–Stable–Decay schedules) reduce sharpness yield better downstream performance, even when they achieve comparable or worse pre-training loss. Combining these strategies yields a new pre-training recipe that substantially outperforms standard baselines with minimal compute overhead, delivering a better learning–forgetting frontier during fine-tuning and higher accuracy after quantization.

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 61

Loading