HC2D: Attention-Based Two-Phase Distillation for Transformer Continual Learning

HC2D: Attention-Based Two-Phase Distillation for Transformer Continual Learning

ICLR 2026 Conference Submission21101 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Continual Learning, Vision Transformer, Attention Distillation, Pattern Separation, Cortical Consolidation, Introspection, Asymmetric Distillation

TL;DR: This paper introduces HC2D, a two-phase distillation framework inspired by the hippocampal–cortical memory mechanism.

Abstract: Catastrophic forgetting refers to marked degradation in performance on previously learned tasks after training on new ones, and continual learning aims to mitigate this problem. Many existing works preserve past knowledge by constraining updates within the locally learned representations. However, such locality can hinder the discovery of genuinely novel discriminative cues, thereby intensifying the stability–plasticity dilemma. Inspired by hippocampal–cortical memory theory and the principle of introspection, we propose a novel training framework: Hippocampal-to-Cortical Two-Phase Distillation (HC2D). In Phase I (Pattern Separation), the introspective negative attention regularizer suppresses reuse of the original core peaks but preserves global directional consistency, guiding the student to discover novel, complementary discriminative cues and forming the hippocampal teacher. In Phase II (Cortical Consolidation), we selectively consolidate the hippocampal teacher’s most salient attentional patterns into the cortical backbone via asymmetric distillation without compromising the primary attention distribution. HC2D leverages Vision Transformer–based attention distillation to implement an ``aggressive exploration first, robust consolidation later'' strategy, with virtually no additional inference overhead. Experimental results show that HC2D consistently mitigates catastrophic forgetting and becomes increasingly effective over longer task sequences, offering a biologically inspired and engineering-efficient solution for transformer-based continual learning.

Primary Area: transfer learning, meta learning, and lifelong learning

Submission Number: 21101

Loading