From Continual Learning to SGD and Back: Better Rates for Continual Linear Models

Itay Evron; Ran Levinstein; Matan Schliserman; Uri Sherman; Tomer Koren; Daniel Soudry; Nathan Srebro

From Continual Learning to SGD and Back: Better Rates for Continual Linear Models

Itay Evron, Ran Levinstein, Matan Schliserman, Uri Sherman, Tomer Koren, Daniel Soudry, Nathan Srebro

Published: 12 Jun 2025, Last Modified: 03 Aug 2025CoLLAs 2025 - Workshop TrackEveryoneRevisionsBibTeXCC BY 4.0

Keywords: continual learning, random orderings, convergence rates, sgd

TL;DR: We prove a reduction from continual learning to stepwise-optimal SGD, and use novel last-iterate SGD analysis to yield tighter forgetting rates for continual linear models

Abstract: We theoretically study the common continual learning setup where an overparameterized model is sequentially fitted to a set of jointly realizable tasks. We analyze the forgetting—loss on previously seen tasks—after $k$ iterations. For continual linear models, we prove that fitting a task is equivalent to a *single* stochastic gradient descent (SGD) step on a modified objective. We develop novel last-iterate SGD upper bounds in the realizable least squares setup, which we then leverage to derive new results for continual learning. Focusing on random orderings over $T$ tasks, we establish *universal* forgetting rates, whereas existing rates depend on the problem dimensionality or complexity. Specifically, in continual regression with replacement, we improve the best existing rate from $\mathcal{O}((d-\bar{r})/k)$ to $\mathcal{O}(\min(1/\sqrt[4]{k}, \sqrt {d-\bar{r}}/k, \sqrt {T\bar{r}}/k))$, where $d$ is the dimensionality and $\bar{r}$ the average task rank. Furthermore, we establish the first rate for random task orderings *without* replacement. The obtained rate of $\mathcal{O}(\min(1/\sqrt[4]{T},\, (d-r)/T))$ proves for the first time that randomization alone—with no task repetition—can prevent catastrophic forgetting in sufficiently long task sequences. Finally, we prove a matching $\mathcal{O}(1/\sqrt[4]{k})$ forgetting rate for continual linear *classification* on separable data. Our universal rates apply for broader projection methods, such as block Kaczmarz and POCS, illuminating their loss convergence under i.i.d. and one-pass orderings.

Submission Number: 5

Loading