Transfer is All You Need: Revisiting the Stability–Plasticity Dilemma through Backward and Forward Transfer in PLMs

ICLR 2026 Conference Submission18684 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: PLM, LLM, Continual Learning, Incremental Learning, Lifelong Learning, Stability-Plasticity Dilemma
TL;DR: Forgetting in IL mainly comes from the classifier, not the backbone. We propose Just LM-Head Tuning (JLT), which achieves SOTA performance by simply re-training the LM Head.
Abstract: Incremental Learning (IL) has long been an important research area in neural networks. Since IL requires retaining prior knowledge while learning tasks sequentially, many studies have primarily focused on 'Memory Stability' to address catastrophic forgetting, while paying less attention to 'Learning Plasticity'. However, this perspective has recently been challenged. Recent studies have demonstrated that the backbone exhibits sufficiently strong anti-forgetting capabilities, while the classifier (LM Head) is the primary source of forgetting. Moreover, as research on Learning Plasticity has gradually expanded, conflicting findings have emerged regarding the relationship between forgetting and forward transfer. For this issue, we propose a method to evaluate the forgetting and forwarding ability of the backbone itself and compare it with the evaluation in the classifier. To this end, we re-establish the famous metrics BWT (Backward Transfer) and FWT (Forward Transfer) and analyze the correlation between the two. As a result, we find that BWT and FWT are measured completely differently in Classifier, Probing Classifier, and Backbone, and this is the cause of the conflict in previous studies. In addition, we observed that the considerable capability of the backbone is not effectively transferred to the classifier (LM Head). To address this, we propose 'Just LM-Head Tuning (JLT)', a simple yet highly effective approach that leverages the backbone trained through the IL process to optimize the classifier (LM Head). JLT is compatible with all existing IL methods and achieves state-of-the-art (SOTA) performance while allowing the backbone to remain unfrozen and continue acquiring knowledge. This effectiveness has been demonstrated not only on older discriminative backbones such as BERT, but also on very recent generative backbones such as LLaMA3.2 and Qwen3 across five representative benchmarks.
Supplementary Material: zip
Primary Area: transfer learning, meta learning, and lifelong learning
Submission Number: 18684
Loading