Activation Functions and Normalization in Deep Continual Learning

08 Jan 2026 (modified: 28 Jan 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Deep learning models often struggle to remain adaptable in continual learning scenarios, where the data distribution changes over time. Beyond the well-known challenge of catastrophic forgetting, these models also face plasticity loss that is characterized as the gradual decline in their ability to learn from future data. We study plasticity loss through the lens of activation and normalization interactions. Through a large-scale empirical study, we evaluate 26 activation functions across three normalization strategies using ResNet-18 on the class-incremental CIFAR-100 benchmark. Our findings reveal that plasticity is not determined by any single design choice, but rather is influenced by the complex interaction between activation functions and normalization layers. We uncover a link between overfitting and plasticity loss, and show that simple yet effective training strategies, such as applying soft labels, learning rate warm-up and excluding affine normalization parameters from L2 regularization can significantly slow down the emergence of plasticity loss. Based on these findings, we offer additional recommendations for model design and training, we keep the networks inherently more performant and adaptable over a long time without any active component.
Submission Type: Long submission (more than 12 pages of main content)
Assigned Action Editor: ~Elahe_Arani1
Submission Number: 6917
Loading