Keywords: continual learning; flat sharpness; sharpness-aware minimization
Abstract: Continual Learning (CL) aims to train neural networks on a dynamic task stream without forgetting previously learned knowledge.
With the rise of pre-training techniques, strong model generalization has become essential for stable learning. C-Flat is a powerful and general CL training regime that promotes generalization by seeking flatter optima across sequential tasks. However, it requires three additional gradient computations per step, resulting in up to 4× computational overhead. In this work, we propose C-Flat Turbo, a faster yet stronger optimizer with a relaxed scheduler, to substantially reduce training cost. We disclose that gradients toward first-order flatness contain direction-invariant components with respect to the proxy model at $\theta + \epsilon_1^*$, which allows us to skip redundant gradient computations in the perturbed ascent steps. Furthermore, a stage-wise step scheduler and adaptive triggering of the regularization mechanism enable dynamic control of C-Flat behavior throughout training. Experiments demonstrate that our optimizer accelerates most CL methods by at least 1$\times$ (up to 1.25$\times$) over C-Flat, while achieving better performance. Code will be released upon acceptance.
Supplementary Material: pdf
Primary Area: transfer learning, meta learning, and lifelong learning
Submission Number: 12146
Loading