TL;DR: We investigate how network size and feature learning regimes affect catastrophic forgetting in continual learning.
Abstract: Despite recent efforts, neural networks still struggle to learn in non-stationary environments, and our understanding of catastrophic forgetting (CF) is far from complete.
In this work, we perform a systematic study on the impact of model scale and the degree of feature learning in continual learning. We reconcile existing contradictory observations on scale in the literature, by differentiating between *lazy* and *rich* training regimes through a variable parameterization of the architecture. We show that increasing model width is only beneficial when it reduces the amount of *feature learning*, yielding more laziness. Using the framework of dynamical mean field theory, we then study the infinite width dynamics of the model in the feature learning regime and characterize CF, extending prior theoretical results limited to the lazy regime. We study the intricate relationship between feature learning, task non-stationarity, and forgetting, finding that high feature learning is only beneficial with highly similar tasks. We identify a transition modulated by task similarity where the model exits an effectively lazy regime with low forgetting to enter a rich regime with significant forgetting. Finally, our findings reveal that neural networks achieve optimal performance at a critical level of feature learning, which depends on task non-stationarity and *transfers across model scales*. This work provides a unified perspective on the role of scale and feature learning in continual learning.
Lay Summary: Neural networks are powerful learners when their tasks stay consistent, but they struggle when tasks change, often losing previously learned information—a problem known as catastrophic forgetting. Although catastrophic forgetting has been studied for a long time, it’s still unclear how best to prevent it. For instance, we don’t know if simply making networks larger helps them remember better, as research findings have been mixed.
We systematically studied how network size and their level of adaptability—called “feature learning”—affect their ability to continuously remember changing tasks. Networks can be “lazy,” meaning they minimally adjust their internal settings, or “rich,” meaning they extensively modify themselves. Our experiments show that making networks larger only helps if they remain relatively “lazy.” Networks that extensively adapt their internal settings rapidly forget previously learned tasks, especially if the tasks are very different, and making these networks larger doesn’t prevent this forgetting.
Our results demonstrate that maintaining some degree of laziness is essential for neural networks to retain knowledge over time. Practically, this insight helps researchers and engineers design better neural networks for real-world applications where tasks frequently change, such as adaptive robotics or lifelong learning systems.
Primary Area: Deep Learning
Keywords: feature learning, continual learning, deep learning, forgetting, ntk, muP, lazy, dmft
Submission Number: 16178
Loading