Orthogonal Updates Are Optimal for Continual Learning

ICLR 2026 Conference Submission719 Authors

02 Sept 2025 (modified: 23 Dec 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Continual Learning, Weight Pertubation, Neural Tangent Kernel
TL;DR: Orthogonal Updates Are Optimal for Continual Learning.
Abstract: Catastrophic forgetting arises when updates for new tasks perturb predictions on earlier ones. We pose continual learning as \emph{interference minimization} and show that, under a first-order (linearized) model of training dynamics, \emph{orthogonal task updates across layers} are both \emph{necessary for zero interference and sufficient to achieve the minimum interference bound}. From a function perspective, the Neural Tangent Kernel (NTK) view identifies interference-free learning with a \emph{zero cross-kernel} block. We prove that, under a mild spectral-concentration assumption on cross-layer Jacobians, this functional condition is approximately realized by \emph{layerwise Frobenius orthogonality}, yielding a unified parameter–gradient–function principle. Guided by this principle, we design a basis-agnostic \emph{orthogonal decomposition} where tasks share an orthogonal basis but use disjoint sparse supports. This construction guarantees exact non-interference at finite width (in the first-order sense), provides an explicit sparsity–error trade-off, and yields high-probability quadratic capacity $O(d^2/k)$ with constant per-task training cost, up to precomputation of patterns. Empirically, on class-incremental benchmarks our method attains competitive accuracy and strong robustness to forgetting, and matches the predicted capacity/efficiency behavior. Overall, we identify orthogonality as the locally optimal first-order structure for continual learning and provide a simple, constructive framework to enforce it in practice.
Primary Area: transfer learning, meta learning, and lifelong learning
Submission Number: 719
Loading