%\documentclass{uai2025} % for initial submission
\documentclass[accepted]{uai2025} % after acceptance, for a revised version; 
% also before submission to see how the non-anonymous paper would look like 
                        
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2025} % ptmx math instead of Computer
                                         % Modern (has noticeable issues)
% \documentclass[mathfont=newtx]{uai2025} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
\bibliographystyle{plainnat}
\renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams


\newcommand{\cmark}{\ding{51}}%
\newcommand{\xmark}{\ding{55}}%

\usepackage{color}
\usepackage{microtype}
\usepackage{graphicx}
\usepackage{subfigure}
\usepackage{booktabs} % for professional tables
\usepackage{comment}
\usepackage{pifont}
\usepackage[utf8]{inputenc} % allow utf-8 input
\usepackage[T1]{fontenc}    % use 8-bit T1 fonts
\usepackage{hyperref}       % hyperlinks
\usepackage{url}            % simple URL typesetting
\usepackage{booktabs}       % professional-quality tables
\usepackage{amsfonts}       % blackboard math symbols
\usepackage{nicefrac}       % compact symbols for 1/2, etc.
\usepackage{microtype}      % microtypography
\usepackage{lipsum}
\usepackage{graphicx}
\usepackage{amsthm}
\usepackage{float}
\usepackage{graphics}
\usepackage{graphicx}
\usepackage[skip=0pt]{caption}
\usepackage{algorithm}
\usepackage{algorithmic}
\usepackage{multirow}
\usepackage{makecell}
\usepackage{balance}
\usepackage{subcaption}
\usepackage[inline]{enumitem}
\usepackage{mathtools,lipsum}
\usepackage{amsmath}
\usepackage{bm}
\usepackage{wrapfig}
\usepackage{amssymb}
\usepackage{float}
\usepackage{mathrsfs}
\usepackage{soul}



\DeclarePairedDelimiter{\ceil}{\lceil}{\rceil}
\usepackage{cuted}
\setlength\stripsep{3pt plus 1pt minus 1pt}


\allowdisplaybreaks[4]
\theoremstyle{definition}

\DeclareMathOperator*{\argmax}{arg\,max}
\DeclareMathOperator*{\argmin}{arg\,min}

\input{symbols_commands}

\newcommand{\alg}{$\mathsf{Prometheus}$-$\mathsf{SG}$~}
\newcommand{\algns}{$\mathsf{Prometheus}$-$\mathsf{SG}$}
\newcommand{\algvr}{$\mathsf{Prometheus}$~}
\newcommand{\algvrns}{$\mathsf{Prometheus}$}
\newcommand{\defeq}{\overset{\text{\tiny def}}{=}}

\newcommand{\kevin}[1]{\textcolor[RGB]{255,0,0}{[Kevin: #1]}}
\newcommand{\jin}[1]{\textbf{\color{purple} Jin: [#1]}}

\newcommand{\doublehat}[1]{\hat{\hat{#1}}}

\usepackage{hyperref}

%\setlength{\tolerance}{1000}

% if you use cleveref..
\usepackage[capitalize,noabbrev]{cleveref}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% THEOREMS
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\theoremstyle{plain}
\newtheorem{theorem}{Theorem}[section]
\newtheorem{proposition}[theorem]{Proposition}
\newtheorem{lemma}[theorem]{Lemma}
\newtheorem{corollary}[theorem]{Corollary}
\theoremstyle{definition}
\newtheorem{definition}[theorem]{Definition}
\newtheorem{assumption}[theorem]{Assumption}
\theoremstyle{remark}
\newtheorem{remark}[theorem]{Remark}

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example

\title{Divide and Orthogonalize: Efficient Continual Learning \\ with Local Model Space Projection}

% The standard author block has changed for UAI 2025 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1]{\href{mailto:<imjshang@amazon.com>?Subject=Your UAI 2025 paper}{Jin Shang}}
\author[1]{\href{mailto:<simengsh@amazon.com>?Subject=Your UAI 2025 paper}{Simone Shao}}
\author[1]{\href{mailto:<tongtn@amazon.com>?Subject=Your UAI 2025 paper}{Tian Tong}}
\author[1]{\href{mailto:<fnam@amazon.com>?Subject=Your UAI 2025 paper}{Fan Yang}}
\author[1]{\href{mailto:<yetichen@amazon.com>?Subject=Your UAI 2025 paper}{Yetian Chen}}
\author[1]{\href{mailto:<jaoyan@amazon.com>?Subject=Your UAI 2025 paper}{Yang Jiao}}
\author[2,1]{\href{mailto:<liu@ece.osu.edu>?Subject=Your UAI 2025 paper}{Jia Liu}}
\author[1]{\href{mailto:<yanngao@amazon.com>?Subject=Your UAI 2025 paper}{Yan Gao}}
% Add affiliations after the authors
\affil[1]{%
    Amazon.com\\
    Seattle, WA, USA
}
\affil[2]{%
    The Ohio State University\\
    Columbus, OH, USA\\
}
\affil[1]{\texttt{\{imjshang, simengsh, tongtn, fnam, yetichen, jaoyan, yanngao\}@amazon.com}}
\affil[2]{\texttt{\ liu@ece.osu.edu}}
  
\begin{document}
\maketitle

\begin{abstract}
Continual learning (CL) has gained increasing interest in recent years due to the need for models that can continuously learn new tasks while retaining knowledge from previous ones. However, existing CL methods often require either computationally expensive layer-wise gradient projections or large-scale storage of past task data, making them impractical for resource-constrained scenarios.
To address these challenges, we propose a local model space projection (LMSP)-based continual learning framework that significantly reduces computational complexity from $\mathcal{O}(n^3)$ to $\mathcal{O}(n^2)$ while preserving both forward and backward knowledge transfer with minimal performance trade-offs. We establish a theoretical analysis of the error and convergence properties of LMSP compared to conventional global approaches.
Extensive experiments on multiple public datasets demonstrate that our method achieves competitive performance while offering substantial efficiency gains, making it a promising solution for scalable continual learning.
\end{abstract}

\section{Introduction}\label{intro}
Humans have the unique ability to continuously learn new tasks throughout their lives without forgetting their previously learned knowledge.
This remarkable capability has recently inspired the efforts in the machine learning community to develop similar capabilities for deep neural network (DNN)-based machine learning models, which is termed continual learning (CL).
However, one of the most significant challenges in CL is that DNN models are known to suffer from the problem of ``catastrophic forgetting'', i.e., the  performances of the learned old tasks decay after the model learns new tasks.
In the literature, numerous strategies have been proposed to address the challenge of catastrophic forgetting in CL.
Existing forgetting mitigation approaches can be classified into three major categories:
i) experience replay,
ii) regularization,
and iii) orthogonal projection (see Section~\ref{Section: related} for more in-depth discussions).
Generally speaking, experience-replay-based methods constrain the gradient directions by replaying the data of old tasks during learning new tasks, in the format of either real data or synthetic data from generative models, while regularization-based methods penalize the modification on the most important weights of old tasks through model regularizations.
Due to the mixed information of old and new tasks (model or data), some performance decay of the old tasks are inevitable under the experience replay and regularization-based approaches.
In contrast, orthogonal-projection-based methods update the model in the direction {\em orthogonal} to the subspace of old tasks, which has demonstrated superior performance compared to other approaches \citep{saha2021gradient} -- a highly desirable feature for CL in practice.

We note, however, that due to a number of technical challenges, developing practical orthogonal-projection-based CL approaches remains highly non-trivial.
The first major challenge of orthogonal-projection-based CL approaches stems from the projection operation, which typically relies on singular-value decomposition (SVD)~\citep{lin2022beyond,lin2022trgp}.
These methods perform \textbf{layer-wise} SVD after the training of each task.
It is well-known that the SVD operation costs $O(n^3)$ complexity for a $n$-dimensional model, which grows rapidly as $n$ increases.
With the ever-increasing widths and depths of large and deep learning models, computing such layer-wise SVDs upon the completion of each new task's training also becomes more and more difficult.

Another key challenge of the standard orthogonal-projection-based CL approaches lies in the inherent difficulty in facilitating {\em forward and backward knowledge transfer} (i.e., the learning of new tasks benefiting from the acquired knowledge from old tasks, and the knowledge learnt from new tasks further improves the performance of old tasks), when new task has strong similarity with some old tasks. However, integrating the computational efficiency into an orthogonal-projection-based continual learning (CL) framework—while preserving performance and enabling both forward and backward knowledge transfer—remains a significant challenge.
This motivates us to pursue a new efficient orthogonal-projection-based CL design.% to fill this gap in the CL literature. 

\begin{comment}
Another key challenge of the standard orthogonal-projection-based CL approaches lies in the inherent difficulty in facilitating {\em forward and backward knowledge transfer} (i.e., the learning of new tasks benefiting from the acquired knowledge from old tasks, and the knowledge learnt from new tasks further improves the performance of old tasks), when new task has strong similarity with some old tasks. To date, it remains unclear how to design {\em computation-efficient} orthogonal-projection-based CL methods without forgetting while enjoying forward-backward knowledge transfers.
This motivates us to pursue a new efficient orthogonal-projection-based CL design to fill this gap in the CL literature.
\end{comment}

In this paper, we propose an efficient local low-rank orthogonal-projection-based CL method based on local model space projection (LMSP), which not only significantly reduces the complexity of SVD basis computation, but also enables forward and backward knowledge transfers without sacrificing too much performance.
The main results and contributions of this paper are as follows:

%\vspace{-.1in}
\begin{list}{\labelitemi}{\leftmargin=1em}
%\itemsep -1pt
\item Our proposed LMSP-based orthogonal projection approach is based on the basic idea of ``divide and orthogonalize'' principle, where we approximate the per-layer parameter matrix by a set of local low-rank matrices defined by a set of anchor points, which significantly reduces the computational complexity from $\mathcal{O}(n^3)$ to $\mathcal{O}(n^2)$ in performing projections with a minor projection error.

\item We theoretically show that our proposed LMSP-based orthogonal projection approach achieves an $\mathcal{O}(1/K)$ convergence rate performance under both convex and non-convex settings, where $K$ is the number of iterations.
Moreover, we further prove the forward and backward knowledge transfers of the proposed LMSP-based orthogonal projection approach. 
In addition, by characterizing the upper bound and lower bounds for the approximation error, we 
provide the approximation accuracy analysis for using LMSP-based orthogonal projection approach compared to the original full-rank based approach.

\item Through extensive experiments, we demonstrate that our proposed LMSP-based orthogonal projection approach achieves performance comparable to state-of-the-art baselines on four public datasets in terms of training accuracy and forward/backward knowledge transfer. Moreover, our approach significantly enhances efficiency while maintaining competitive performance, even compared to the original full-rank approach. We further conduct ablation studies to validate the effectiveness and efficiency of each key component in our LMSP-based design.

\end{list}

\section{Related Work}
\label{Section: related}
In this section, we provide an overview on the continual learning and local low-rank model approximation literature to further motivate this research and put our work in comparative perspectives.

%\smallskip
{\bf 1) Continual Learning: A Primer.}
Continual learning (CL), also known as lifelong learning or incremental learning, is an emerging area in machine learning research that has attracted a significant amount of interests recently.
CL addresses the challenge of enabling a machine learning model to accumulate knowledge and adapt to new tasks that arrive sequentially over time \citep{chen2018lifelong}. 
A key goal of CL is to avoid ``catastrophic forgetting''  \citep{mccloskey1989catastrophic,abraham2005memory}, i.e., a model's performance on previously learned tasks decays upon learning new tasks. To mitigate catastrophic forgetting in CL, various methodologies and strategies have been proposed: 

%\vspace{-.1in}
\begin{list}{\labelitemi}{\leftmargin=1em}

\item {\em Regularization-Based Approaches:} Regularization approaches use regularization to prevent a learning model from over-fitting to training data.
For example, elastic weight consolidation (EWC) \citep{kirkpatrick2017overcoming} regularizes the updates on weights based on their significance for previous tasks using the Fisher information matrix. \citet{aljundi2018memory} used an unsupervised and online approach to evaluate the model output's sensitivity to the inputs and penalize changes to important parameters.


\item {\em Replay-Based Approaches:} Replay-based approaches store and replay old tasks' data to help models retain knowledge. 
For example, generative replay \citep{shin2017continual} generates data samples from previous tasks.
In experience replay \citep{chaudhry2019tiny}, a model replays previous experiences in a controlled manner.
Techniques such as experience replay with replay buffer (ER-RB) \citep{lillicrap2019continuous} and generative adversarial networks (GANs) \citep{goodfellow2020generative} have also been developed to enhance the efficiency of these mechanisms.

\item {\em Orthogonal-Projection-Based Approaches:} 
To eliminate the need of storing data of old tasks or tuning the regularization parameter, researchers have proposed to learn the the new tasks and update the model in the {\em orthogonal subspace} of the old tasks \citep{chaudhry2020continual}, which has demonstrated superior performance compared to other approaches \citep{saha2021gradient}.
State-of-the-art orthogonal-projection-based approaches include, e.g., \citep{lin2022beyond}, first characterizes the task correlation to identify the positively correlated old tasks in a layer-wise manner, and then selectively modifies the learned model of the old tasks when learning the new task.
More recently, several new techniques such as those proposed in \citep{yang2024introducing} and \citep{xu2024disentangled}, have been applied to orthogonal-projection-based approaches, yielding significant improvements in both forward and backward knowledge transfer.

\item {\em Prompt-Based Continual Learning Approaches:}
As large language models (LLMs) continue to be explored in greater depth, prompt-based continual learning methods, such as L2P \citep{wang2022learning}, DualPrompt \citep{wang2022dualprompt}, and HiDE \citep{wang2023hierarchical}, are gaining popularity. These approaches typically prepend task-specific prompts (e.g., learnable tokens or embeddings) to the input or internal activations. In such methods, the base model remains largely \textbf{frozen} or undergoes minimal updates, with learning primarily occurring in the prompt space. Each task is associated with a distinct prompt, enabling the model to adapt dynamically based on the given prompt.
In contrast, Orthogonal-Projection-Based (e.g., GPM \citep{saha2021gradient}, OWM \citep{zeng2019continual}, and A-GEM variants \citep{chaudhry2018efficient}) update the model weights directly. These techniques project gradients orthogonally to the subspaces corresponding to previously learned tasks, ensuring that the model parameters are \textbf{not frozen} and continue to evolve while preserving past knowledge.
\end{list}

%\smallskip
%\vspace{-.1in}
{\bf 2) Local Low-Rank Approximation:}
Due to the superior performance compared to other approaches, we focus on the orthogonal-projection-based approach for CL in this paper.
However, a key challenge of the orthogonal-projection-based CL approach stems from the need for computing orthogonal subspace, which is highly expensive as the model size gets large. 
This motivates us to propose a local model space projection (LSMP) approach based on local low-rank approximation to lower the the orthogonal subspace computation complexity.
Recent works, such as \citep{li2024unigrad}, have focused on improving continual learning (CL) efficiency by optimizing gradient directions and mitigating gradient conflicts during training. In contrast, our work primarily aims to reduce the computational cost of orthogonal-projection-based CL while maintaining competitive performance.
%In what follows, we provide an overview on local low-rank approximation.
The key idea of our posed LMSP approach is based on the local low-rank approximation (LRA) of matrics.
LRA techniques have been widely applied in the areas of matrix factorization \citep{billsus1998learning, mnih2007probabilistic, salakhutdinov2008bayesian, candes2010matrix}. 
The basic idea of these existing works is to represent a given matrix by a product of lower-rank matrices that capture the essential structure of the original matrix. 

Local low-rank approximation (LLRA) extends LRA to preserve low-rank structures in localized regions of matrices. 
LLRA has been applied in various applications, such as recommendation \citep{beutel2017beyond, sarwar2002recommender, christakopoulou2018local}, collaborative filtering \citep{george2005scalable, lee2014local, koren2008factorization}. 
For example, \citet{lee2013local} proposed a local low-rank matrix approximation (LLORMA) method, which finds anchor points of the matrix and estimates local low-rank matrices in the neighborhood surrounding each anchor point. 
Then, a weighted sum of the local matrices is used to approximate the original matrix, where the weight is the similarity between the pair of anchor points. 
\citet{lee2014local} later used this method in collaborative filtering to estimate the user-item rating matrix with a weighted combination of local matrices.
To our knowledge, our work is the first to leverage the local low-rank approximation approach for CL.

\begin{comment}
\section{Problem Formulation and Preliminaries}
\label{sec:problem_statement}

In this section, we first formally state the problem formulation of CL and then introduce the standard full-rank orthogonal-projection-based approach for CL and its fundamental computational complexity challenge.

\smallskip
{\bf 1) Continual Learning:}
Continual learning (CL) considers a set of tasks $\mathbb{T} = \{t\}_{t=0}^T$ that arrive sequentially at the learner.
Each task $t$ is associated with a dataset $\mathcal{D}_{t} = \{(\x_{t, i}, \y_{t, i})\}_{i=1}^{N^t}$ that contains $N^t$ samples, where $\x_{t, i}$ and $\y_{t, i}$ are the $i$-th data point and its label in task $t$.
In this paper, we consider a fixed-capacity neural network with $L$ layers, with weights being denoted 
%\jin{we never use this mathbb W later} 
as $\{\W^l\}_{l=1}^{L}$, where $\W^l$ is the layer-wise weight for the $l$-th layer.
We let $\x_{t,i}^{l}$ denote the input of layer $l$, with $\x_{t,i}^{1} = \x_{t,i}$.
The output of layer $l$, which is also the input of layer $l+1$, is computed as 
$$\x_{t,i}^{l+1} = f(\W^l, \x_{t,i}^{l}),$$ 
where $f(\cdot)$ denotes the processing at layer $l$.
In this paper, we focus on the CL setting where we only have access to the dataset of the new task $\mathcal{D}_{t}$ and {\em no} data samples of old tasks $j \in [0, t-1]$ are available. 
We denote the loss function as $\mathcal{L}(\W, \mathcal{D}_t) = \mathcal{L}_t(\W)$, where $\W$ denotes the weights for the entire neural network model. 
To learn task $t$ in CL, we have the weights $\W_{t-1}$ after learning for task $t-1$.
The purpose of CL is to learn the new task $t$ based on the weights in $\W_{t-1}$ and the new data $\mathcal{D}_{t}$.

\smallskip
{\bf 2) Full-rank Orthogonal-Projection-Based Approach for CL:}
To address the forgetting challenge in CL, there has been a recent line of works that propose model updating for the new task in the direction orthogonal to the subspace spanned by the old tasks' input.
As an illustration of this basic idea, let $D_j$ be the subspace spanned by the inputs of task $j$.
Then, the subspace spanned by the inputs of task 1's layer $l$ could be denoted as $D_1^l$.
The learned model for task 1 is denoted %\jin{we never use this mathbb W later} 
as $\{\W_{1}^{l}\}_{l}^{L}$.
To learn task 2, the current model $\W_{1}^{l}$ will be updated in a direction orthogonal to $D_1^l$.
Let $\Delta \W_1^l$ denote the model update after learning task 2.
It follows from the orthogonal direction that 
$\Delta \W_1^l \x_{1,i}^{l} = 0.$
Also, after learning task 2, the model is 
$\W_2^l = \W_1^l + \Delta \W_1^l.$
Thus, we have 
$$\W_2^l \x_{1,i}^l = (\W_1^l + \Delta \W_1^l)\x_{1,i}^{l} = \W_1^l \x_{1,i}^{l} + \Delta \W_1^l \x_{1,i}^{l} = \W_1^l \x_{1,i}^l,$$ 
which implies that there is {\em no} interference to task 1 after the learning of task 2, hence avoiding ``forgetting.''

\smallskip
{\bf 3) Full-rank Orthogonal-Projection-Based CL Approach with Backward Knowledge Transfer:}
%Based on Task Similarity:}
Although full-rank orthogonal-projection-based approaches can effectively address the forgetting problem, one of its key limitations is that forward and backward knowledge transfers are impossible due to the restriction of model updates only in the subspace orthogonal to the input space of old tasks.
To address this limitation, a trust region approach is proposed in \citep{lin2022trgp}, which is built upon the following definitions~\citep{lin2022trgp}:


\begin{definition}[Sufficient Projection~\citep{lin2022trgp}] \label{defn:SuffProj}
For any new task $t \in [1, T]$, we say that it has sufficient gradient projection on the input subspace of an old task $j \in [0, t-1]$ if for some $\lambda_1 \in (0, 1)$, it holds that
    $\|\mathrm{Proj}_{S_j}(\nabla \mathcal{L}_t(\W_{t-1}))\|_2 \ge \lambda^\prime_1 \|\nabla \mathcal{L}_t(\W_{t-1})\|_2$.
\end{definition}

%\vspace{-.1in}
Here, $\mathrm{Proj}_{S_j}(\A) = \S_j (\S_j)^{\top} \A$ denotes the projection onto the input subspace $D_j$ of task $j$, where $\S_j$ is the basis of $D_j$.
The definition of sufficient projection implies that tasks $t$ and $j$ have sufficient common basis between their input subspaces and hence strong correlation.

While the sufficient projection condition above suggests a strong correlation between tasks $t$ and $j$, a stronger condition suggesting a positive correlation between tasks is also introduced in \citep{lin2022beyond} as follows:

\begin{definition}[Positive Correlation~\citep{lin2022trgp}] \label{defn:PosCorr}
A new task $t \in \{1,\ldots,T\}$ has a positive correlation with an old task $j \in \{0,\ldots,t-1\}$ if for some $\lambda^\prime_2 \in (0,1)$, it holds that
$\langle \nabla \mathcal{L}_j(\W_j), \nabla \mathcal{L}_t(\W_{t-1}) \rangle \geq \lambda^\prime_2 \| \nabla \mathcal{L}_j(\W_j)\|_2 \| \nabla \mathcal{L}_t(\W_{t-1}) \|_2$.
\end{definition}

Based on Definitions~\ref{defn:SuffProj} and \ref{defn:PosCorr}, 
the model space can be partitioned into three CL regimes, based on which three layer-wise update rules can be applied to either mitigate forgetting or enable knowledge transfer (cf. the continual learning method with forward/backward knowledge (CUBER) \citep{lin2022beyond}):
%\vspace{-.1in}
\begin{list}{\labelitemi}{\leftmargin=1em}
\item {\em Regime~1 (Forget Mitigation):} Due to the weak correlation between tasks in this regime, where $\|\mathrm{Proj}_{D^{l}_{j}}(\nabla \mathcal{L}_t(\W^{l}_{t-1}))\|_2 \leq \lambda^\prime_1 \| \nabla \mathcal{L}_t (\W^{l}_{t-1})\|_2$, the model is updated based on orthogonal projection to avoid catastrophic forgetting as follows:
\begin{align}\label{global_rule1}
\nabla \mathcal{L}_t (\W^l) \leftarrow \nabla \mathcal{L}_t (\W^l) - \mathrm{Proj}_{D^l_j} (\nabla \mathcal{L}(\W^l)).
\end{align}

\item {\em Regime~2 (Forward Knowledge Transfer):} A task $j$'s layer $l$ falls into Regime~2 if sufficient projection holds while positive correlation is not satisfied.
Due to the potential ``negative correlation'' in this regime,  forgetting still needs to be avoided by using orthogonal projection.
However, thanks to the correlation between tasks, one can facilitate forward knowledge transfer.
Putting both ideas together, the update rule in Regime~2 can be written as:
\begin{align}\label{global_rule2}
\nabla \mathcal{L}_t (\W^l) &\leftarrow \nabla \mathcal{L}_t (\W^l) -  \mathrm{Proj}_{D^l_j} (\nabla \mathcal{L}(\W^l)), \\
\Q^{l}_{j,t} &\leftarrow \Q^{l}_{j,t} - \beta \nabla_{\Q} \mathcal{L}_t (\W^l  - \mathrm{Proj}_{D^{l}_{j}}(\W^l) \nonumber \\
& \quad\quad + \W^l \S^l_j \Q^{l}_{j,t} (\S^l_j)^{\top}), \nonumber
\end{align}
where $\S^l_j$ is the basis matrix for subspace $D^l_j$ and $\Q^{l}_{j,t}$ is a diagonal scaling matrix to facilitate forward knowledge transfer (see \citep{lin2022beyond, lin2022trgp} for details).

\item {\em Regime~3 (Backward Knowledge Transfer):} A task $j$'s layer $l$ falls into Regime~3 if both sufficient projection and positive correlation conditions are satisfied.
Due to the positive correlation between tasks, one can use a simple gradient-descent-type rule (with $\lambda$-regularization) to perform model update as follows, which also helps simultaneously improve the performances of old tasks (i.e., backward knowledge transfer):
\begin{align*}
\W^l \leftarrow \W^l - \alpha \nabla [\mathcal{L}_t(\W^l) + \theta \| \mathrm{Proj}_{D^l_j} (\W^l - \W^{l}_{t-1}) \|].
\end{align*}
\end{list}

{\bf 4) Limitations and Challenges of Full-rank Orthogonal-Projection-Based Approaches:}
Although the aforementioned full-rank orthogonal-projection-based approaches (with forward-backward knowledge transfer) could effectively avoid forgetting without needing data from any old tasks, a {\em major challenge}
in such approaches stems from checking the sufficient projection condition, which typically requires performing singular value decomposition (SVD) operations.
It is well-known that SVD has an $O(n^3)$ computational complexity, which increases rapidly as $d$ increases.
Thus, as the model size increases (e.g., in large-scale transformer models), computing SVD is expensive or even infeasible.
This limitation motivates us to develop efficient methods with low computation complexity for orthogonal-projection-based approaches in the subsequent section.
\end{comment}

\section{A Local Model Space Projection Approach}\label{Section: algorithm}

In this section, we first introduce the basic idea of local representation and task subspace construction in Section~\ref{subsec:LocalRep}, based on which we define task similarity with local projection in Section~\ref{subsec:TaskSim}.
These key notions allow us to further propose update rules based on local representations and task subspaces in Section~\ref{subsec:lmsp}.
Lastly, we conduct theoretical performance analysis for our proposed LMSP-based orthogonal projection approach in Section~\ref{subsec:thms}. 

\subsection{Local Representation and Task Space Construction} \label{subsec:LocalRep}

{\em 1) The Basic Idea:}
As mentioned in Section~\ref{intro}, to lower the SVD computational costs in full-rank orthogonal-projection-based CL approaches, the basic idea of our local model space projection (LMSP) approach is based on a ``divide and orthogonalize'' principle.
Our LMSP approach is built upon the following key notion of local model representation.

Given $N^j$ samples in an old task $j \in [0, t-1]$, we construct a representation matrix $\R_{j}^{l} = [\r_{j, 1}^{l},...\r_{j, N^j}^{l}] \in \mathbb{R}^{M \times N^j}$ for layer $l$, where $M$ is the representation dimension and each $\r_{j, i}^{l} \in \mathbb{R}^{M}, i = 1,2,...,N^j$ is the representation of layer $l$ by forwarding the sample data point $\x_{j, i}$ through the model. 
Instead of directly applying SVD to the representation matrix $\R_{j}^{l}$, we approximate the matrix by a set of low-rank matrices defined by a set of anchor points. 

Following a similar token as in \citep{lee2013local}, we define a {\em smoothing kernel} $K_h(s_1, s_2)$ with bandwidth $h$, where $(s_1, s_2) \in [M] \times [N^j]$ is an entry in the representation matrix $\R_{j}^{l}$. 
For convenience, we also denote this kernel matrix by $\K_h^{(a,b)}$.
Also, the $(i,j)$-th entry in $\K_h^{(a,b)}$ is denoted as $K_h((a, b), (i, j))$. 
Simply speaking, the smoothing kernel is a non-negative symmetric unimodal function parameterized by the bandwidth parameter $h > 0$. 
Generally, the larger the value of $h$, the wider the spread of the kernel \citep{wand1994kernel}.

To obtain a set of local representation matrices, we first sample $m$ {\em ``anchor points''} from the global representation matrix $\R_{j}^{l}$, which are denoted as $\{s_q \triangleq (i_q, j_q) \}_{q=1}^m$, where $(i_q, j_q) \in [M] \times [N^j]$ is the entry location of the $q$-th anchor point.
In \citep{wand1994kernel,lee2013local}, it has been shown that the global representation matrix $\R_{j}^{l}$ has a locally low-rank structure and thus could be approximated by the local representation matrices $\{\hat{\R}_{j}^{l}(s_q) \}_{q=1}^{m}$ corresponding to these anchor points (i.e., Nadaraye-Waston regression, note that $\hat{\R}_{j}^{l}(s_q)$ is depended on the specific anchor point $s_q$) as follows:
\begin{align}\label{nonpa}
    \R_{j}^{l} \approx \hat{\hat{\R}}_{j}^{l} \triangleq \sum_{q=1}^m\frac{K_h(s_q, s)}{\sum_{p=1}^m K_h(s_p, s)} \hat{\R}_{j}^{l}(s_q).
\end{align}
To obtain the local representation matrices $\{\hat{\R}_{j}^{l}(s_q) \}_{q=1}^{m}$ in Eq.~\eqref{nonpa}, we adopt a product form for the general kernel function $K_h(s_1, s_2) = K_h((a, b), (c, d)) = K_{h_1}(a, c)K^\prime_{h_2}(b, d)$, where $s_1, s_2 \in [M] \times [N^j]$ and $K, K^\prime$ are kernels on the spaces $[M]$ and $[N^j]$, respectively. 
We summarize several popular smoothing kernels in Appendix \ref{T3}.
In this paper, we use the Gaussian kernel for both $K, K^\prime$ (we will conduct ablation studies on the choice of smoothing kernels later in Section \ref{Section: experiment}).

In the literature, there are two widely used ways to choose the anchor points $\{s_q \triangleq (i_q, j_q) \}_{q=1}^m$: 
1) sample uniformly at random from the representation matrix in $[M] \times [N^j]$; 
and 2) use $K$-means or other clustering methods to pre-cluster the representation matrix and then use their centers as the anchor points. 
Even though using pre-clustering to find centroids as anchor points may provide a more distinct and diverse representation and it is also proved by some works such as \citep{zhang2017local}, our numerical studies later show that the improvements are marginal. 
More specifically, we found that as long as the choices of random anchor points are relatively uniform, the empirical difference between two selection methods is not significant. 
Since the basis of the task are extracted at each layer, considering the huge additional computational costs introduced by this layer-wise clustering methods (e.g., k-means), we elect to use the random sample strategy in our experiments for simplicity in this paper.
%In our numerical studies, we do not observe a significant difference between these two methods. 
%For simplicity, we use the random sample strategy in our experiments.

Next, with local representations, we will show how the local model spaces are constructed for task $j$ at layer $l$.
For an old task $j \in [0, t-1]$, to obtain the basis $\S_{j}^{l}$ at layer $l$, traditional methods \citep{saha2021gradient,lin2022trgp} adopted the standard singular value decomposition (SVD) for the representation matrix of each layer, which incurs a high computation cost of $\mathcal{O}(M N^j \min(M, N^j)) = \mathcal{O}(n^3)$. 
In contrast, by using a low-rank structure for each local model, the computation can be significantly reduced. 
Specifically, we first obtain the local decomposed matrices $\A$ and $\B$ for each anchor point $s_q$ by minimizing the following global least square loss in Eq.~\eqref{nonpa1}:
\begin{align}
    &\{(\A^{(q)}, \B^{(q)})\}^m_{q=1}:=\nonumber \\
    \label{nonpa1} 
    &\argmin_{\A^{(q)}, \B^{(q)}} \sum_{x,y\in \Omega} \left[\sum_{q=1}^m (\frac{K_h^{(q)} \odot [\A^{(q)}\B^{{(q)}^{\top}}]}{\sum_{p=1}^m K_h^{(p)}}  - \R_{j}^{l})^2\right]_{x,y} \nonumber \\
    & \quad \quad \quad \quad + \sum_{q=1}^m[ \lambda_A^{(q)}\|\A^{(q)}\|_{F}^2 + \lambda_B^{(q)} \|\B^{(q)}\|_{F}^2],
\end{align}
where $\Omega$ is the observed set of indices of the matrices, $K_h^{(q)} = K_h^{s_q} = K_h^{(i_q, j_q)}$ is the kernel matrix whose $(a, b)$-th entry is 
$$K_h((i_q, j_q), (a, b)) = K_{h_1}(i_q, a)K^\prime_{h_2}(j_q, b)$$ 
and $\odot$ is the Hadamard product. 
We also add $\ell_2$ regularization as is standard in conventional SVD.
%
Similar to \citep{lee2013local}, we can obtain $(\A^{(q)}, \B^{(q)})$ in a parallel fashion as follows:
\begin{align}\label{eq:obj}
    (\A^{(q)}, \B^{(q)}):&=
    \argmin_{\A, \B} \sum_{x,y\in \Omega}[K_h^{(q)} \odot ([\A \B^{\top}] - \R_{j}^{l})^2]_{x,y} \nonumber \\
    &+ \lambda_A\| \A \|_{F}^2 + \lambda_B \|\B \|_{F}^2.
\end{align}
Being a variant of low-rank matrix completion, this problem can be solved efficiently via various methods, including AltMin \citep{jain2013low,hastie2015matrix}, singular value projection \citep{netrapalli2014non,jain2010guaranteed}, Riemannian GD \citep{wei2016guarantees}, ScaledGD \citep{tong2021accelerating,xu2023power}, etc; see \citep{chen2018harnessing,chi2019nonconvex} for recent overviews.
In this paper, we use the AltMin method to find the optimizer and obtain the basis for each local model. 

%\smallskip
{\em 2) Computation Complexity Analysis:}
Denote the rank for each local model as $r \ll \min(M, N^j)$, and $\A \in \mathbb{R}^{M \times r}, \B \in \mathbb{R}^{N^j \times r}$. 
Later, we adopt QR decomposition for $\A = \hat{\Ut} \boldsymbol{\Psi}_A, \B = \hat{\V}\boldsymbol{\Psi}_B$, where $\boldsymbol{\Psi}_A, \boldsymbol{\Psi}_B \in \mathbb{R}^{r \times r}$, and then perform SVD on the $r\times r$ matrix to achieve: $\boldsymbol{\Psi}_A \boldsymbol{\Psi}_B^\top = \Ut_\Psi\boldsymbol{\Sigma}\V_\Psi^\top$. The final basis for local model space $q$ can be constructed as $\{ \S^{l, (q)}_{j} \triangleq {{\hat{\Ut}}^{l, (q)}}_{\Psi, j} {\Ut}_{\Psi, j}^{l, (q)}\}_{q=1}^m  \in \mathbb{R}^{M \times r}$.

Then, for a new task $t$, we treat all $m$ local model spaces as $m$ old tasks.
As a result, we have a total of $tm$ old tasks as candidates for new task $t$ to find the top-$k$ correlated ones. 
Since the AltMin algorithm has the complexity of $\mathcal{O}(M N^j r) = \mathcal{O}(n^2)$, the total complexity can be reduced to $\mathcal{O}(n^2 m) = \mathcal{O}(n^2)$, as the total number of anchor points $m \ll \min(M, N^j)$. 
Thus the computation cost in LMSP is significantly reduced.

\subsection{Task Similarity with Local Projection} \label{subsec:TaskSim}
With the local representations in Section~\ref{subsec:LocalRep}, we are now in a position to introduce the following definitions on task gradients to formally characterize the task similarity.
%Toward this end, we need the following definitions that generalize Definitions~\ref{defn:SuffProj} and \ref{defn:PosCorr} to local settings:
Toward this end, we introduce the following definitions, which generalize Definition 1 and Definition 2 from \citep{lin2022beyond} to local settings.

\begin{definition}[Local Sufficient Projection] \label{def:LocalSuffCond}
For any new task $t \in [1, T]$, we say that it has local sufficient gradient projection on the local subspace $q \in [1, m]$ of old task $j \in [0, t-1]$ if for some $\lambda_1 \in (0, 1)$:
%\begin{align}
    $\|\mathrm{Proj}_{K_h^{(q)}D_j}(\nabla \mathcal{L}_t(\W_{t-1}))\|_2 \ge \lambda_1 \|\nabla \mathcal{L}_t(\W_{t-1})\|_2$.
%\end{align}
\end{definition}

\begin{definition}[Local Positive Correlation] \label{defn:LocalPosCorr}
For any new task $t \in [1, T]$, we say that it has local positive correlation with the local subspace $q \in [1, m]$ of old task $j \in [0, t-1]$ if for some $\lambda_2 \in (0, 1)$:
    $\langle \nabla \mathcal{L}_j^{(q)}(\W_j^{(q)}), \nabla \mathcal{L}_t(\W_{t-1})\rangle \ge \lambda_2 \|\nabla \mathcal{L}_j^{(q)}(\W_j^{(q)})\|_2 \|\nabla \mathcal{L}_t(\W_{t-1})\|_2$.
\end{definition}

Here, for any matrix $\A$, $\mathrm{Proj}_{K_h^{(q)}D_j}(\A) \triangleq \S^{(q)}_{j} {\S^{(q)}_{j}}^\top \A$ defines the projection on the input local model space for anchor point $q$ of old task $j$, and ${\S^{(q)}_{j}}$ is the basis for this local model space. 

%Compared to Definition~\ref{defn:SuffProj}, the projection space in Definition~\ref{def:LocalSuffCond} is changed to the $q$-th local model basis rather than the global basis for task $j$. 
Compared to global sufficient definition, which is the Definition 1 in \citep{lin2022beyond}, the projection space in Definition~\ref{def:LocalSuffCond} is changed to the $q$-th local model basis rather than the global basis for task $j$. 

Definition~\ref{def:LocalSuffCond} implies that task $t$ and the $q$-th local model of task $j$ have sufficiently common basis and are strongly correlated since the gradient lies in the span of the input \citep{zhang2021understanding}. 
%Also, similar to Definition~\ref{defn:PosCorr}, Definition~\ref{defn:LocalPosCorr} goes one step further to characterize the task similarity. 
Also, similar to positive correlation definiation, which is Definition 2 in \citep{lin2022beyond}, Definition~\ref{defn:LocalPosCorr} goes one step further to characterize the task similarity. 

In addition to the local sufficiency projection and positive correlation conditions, we introduce a {\em new} concept, termed {\em ``local relative orthogonality''}, specifically tailored for our LMSP-based method, defined as follows:

\begin{definition}[Local Relative Orthogonality] \label{defn:LocalRelOrth}
For any new task $t \in [1, T]$, we say that it is more locally relatively orthogonal to local subspace $q \in [1, m]$ of old task $j \in [0, t-1]$ than the global subspace old task $j \in [0, t-1]$ for some $\lambda_3 \in (0, 1)$ if the following condition holds:
\begin{align*}
&\|\mathrm{Proj}_{K_h^{(q)}D_j}(\nabla \mathcal{L}_t(\W_{t-1}))\|_2 = \\
&\lambda_3 \| \mathrm{Proj}_{D_j}(\nabla \mathcal{L}_t(\W_{t-1}))\|_2 \le \|\mathrm{Proj}_{D_j}(\nabla \mathcal{L}_t(\W_{t-1}))\|_2.
\end{align*}
\end{definition}
The local relative orthogonality means that the input of the $q$-th local model space for old task $j$ is more orthogonal to the new task $t$ than the global one, which indicates that updating the model along the $\nabla \mathcal{L}_t(\W)$ direction would introduce {\em less inference} to old task $j$, thus mitigating the forgetting problem.
Note that Definitions~\ref{defn:LocalPosCorr}--\ref{defn:LocalRelOrth} characterize the similarity based on the old model weights $\W_{t-1}$, hence they allow the task similarity detection before learning the new task $t$.



\subsection{Low-Complexity Continual Learning with Local Model Space Projection} \label{subsec:lmsp}

With the local representations and the associated task similarity, we propose the following LMSP-based orthogonal projection approach, 
%in the spirit of CUBER (cf. Section~\ref{sec:problem_statement}), 
which aims to avoid forgetting while enabling backward knowledge transfer.
Toward this end, based on Definitions~\ref{def:LocalSuffCond} and \ref{defn:LocalPosCorr}, we establish the following regimes and update rules, which correspond to the global settings described in \citep{lin2022beyond}.

%\smallskip
\textbf{Regime~1}~(Forget Mitigation): For a new task $t$'s layer $l$, if $\|\mathrm{Proj}_{K_h^{(q)}D_j^l}(\nabla \mathcal{L}_t(\W_{t-1}^l))\|_2 < \lambda_1 \|\nabla \mathcal{L}_t(\W_{t-1}^l)\|_2$, we say that the $q$-th local model of old task $j$ falls in Regime~1. 

Note that in this case, since task $t$ and task $j^{(q)}$ are {\em relatively orthogonal,} we update the model in the direction of orthogonal projection to avoid forgetting:
\begin{align}\label{local_rule1}
\nabla \mathcal{L}_t(\W^l) \leftarrow \nabla \mathcal{L}_t(\W^l) - \mathrm{Proj}_{K_h^{(q)} D_j^l}(\nabla \mathcal{L}_t(\W^l)).
\end{align}

%\smallskip
\textbf{Regime~2}~(Forward Knowledge Transfer): For a new task $t$'s layer $l$, if it holds that 
\begin{align*}
&\|\mathrm{Proj}_{K_h^{(q)}D_j^l}(\nabla \mathcal{L}_t(\W_{t-1}^l))\|_2 \ge \lambda_1 \|\nabla \mathcal{L}_t(\W_{t-1}^l)\|_2, \\
&\langle \nabla \mathcal{L}_j^{(q)}(\W_j^{l,(q)}), \nabla \mathcal{L}_t(\W_{t-1}^l)\rangle < \\ 
& \quad \quad \quad \quad \lambda_2 \|\nabla \mathcal{L}_j^{(q)}(\W_j^{l,(q)})\|_2 \|\nabla \mathcal{L}_t(\W_{t-1}^l)\|_2,
\end{align*}
we say the $q$-th local model of old task $j$ falls into Regime~2.

In this case, since task $t$ and task $j^{(q)}$ are strongly correlated on gradient norm projection but negatively correlated on gradient direction, we still update the model on the orthogonal projection and use a scalar matrix $\Q$ to facilitate forward knowledge similar to the idea in \citep{lin2022trgp}:
\begin{align}\label{local_rule2}
&\nabla \mathcal{L}_t(\W^l) \leftarrow \nabla \mathcal{L}_t(\W^l) - \mathrm{Proj}_{K_h^{(q)}D_j^l}(\nabla \mathcal{L}_t(\W^l)), \\
&\Q_{j, t}^{l, (q)} \leftarrow \Q_{j, t}^{l, (q)} - \beta \nabla_{\Q} \mathcal{L}_t(\W^l - \mathrm{Proj}_{K_h^{(q)}D_j^l}(\W^l) \nonumber \\
& \quad\quad\quad\quad - \W^l{\S^{l, (q)}_{j}} \Q_{j, t}^{l, (q)}{\S^{l, (q)}_{j}}^\top). \nonumber
\end{align}

%\smallskip
\textbf{Regime~3}~(Backward Knowledge Transfer): For a new task $t$'s layer $l$, if it holds that 
\begin{align*}
&\|\mathrm{Proj}_{K_h^{(q)}D_j^l}(\nabla \mathcal{L}_t(\W_{t-1}^l))\|_2 \ge \lambda_1 \|\nabla \mathcal{L}_t(\W_{t-1}^l)\|_2, \\
&\langle \nabla \mathcal{L}_j^{(q)}(\W_j^{l,(q)}), \nabla \mathcal{L}_t(\W_{t-1}^l)\rangle \ge \\
& \quad\quad\quad\quad \lambda_2 \|\nabla \mathcal{L}_j^{(q)}(\W_j^{l,(q)})\|_2 \|\nabla \mathcal{L}_t(\W_{t-1}^l)\|_2,
\end{align*}
we say the $q$-th local model of old task $j$ falls into Regime~3. 
 
In this case, since task $t$ and task $j^{(q)}$ are positively correlated in both norm and direction, updating the model directly along with $\nabla \mathcal{L}_t(\W^l)$ not only leads to a better model for continual learning, but also improves the performance of old task $j$. 
Since the weight projection is frozen, i.e., $\mathrm{Proj}_{K_h^{(q)} D_j^l}(\W^l_{t-1}) = \mathrm{Proj}_{K_h^{(q)} D_j^l}(\W^l_j)$, we update the model as follows: 
\begin{align*}
\W^l \!\leftarrow\! \W^l \!-\! \alpha\nabla[\mathcal{L}_t(\W^l) \!+\! \theta\|\mathrm{Proj}_{K_h^{(q)}D_j^l}(\W^l \!-\! \W^l_{t-1})\|].
\end{align*}
In summary, the optimization problem for learning a new task $t$ can be written as follows:
%\begin{small}
\begin{align}\label{obj}
\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!  \min_{\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \W, \{\Q_{j, t}^{l, (q)}\}_{l, j^{(q)}\in \mathrm{Reg}^l_{t, 2} \bigcup \mathrm{Reg}^l_{t, 3}}} \!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!  & \mathcal{L}_t(\{\boldsymbol{\tilde{\W}}^l\}_l) + \nonumber\\
&\theta \sum_l \!\!\!\! \sum_{j^{(q)}\in \mathrm{Reg}^l_{t, 3}}  \!\!\!\!\!\!\!\! \|\mathrm{Proj}_{K_h^{(q)}D_j^l}\!(\W^l \!\!-\!\! \W^l_{t-1})\|, \\
    s.t. \,\, \boldsymbol{\tilde{\W}}^l =& \W^l + \!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\! \sum_{j^{(q)}\in \mathrm{Reg}^l_{t, 2} \bigcup j^{(q)}\in \mathrm{Reg}^l_{t, 3}} \!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\! [\W^l \S^{l, (q)}_{j} \Q_{j, t}^{l, (q)}{\S^{l,(q)}_{j}}^\top \!\!\!\!- \label{constr1} \\
    &\quad \quad \mathrm{Proj}_{K_h^{(q)}D_j^l}(\W^l)], \nonumber \\
    \nabla \mathcal{L}_t(\W^l) =& \nabla \mathcal{L}_t(\W^l) - \!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\! \sum_{j^{(q)}\in \mathrm{Reg}^l_{t, 1} \bigcup j^{(q)}\in \mathrm{Reg}^l_{t, 2}} \!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\! \mathrm{Proj}_{K_h^{(q)}D_j^l}(\nabla \mathcal{L}_t(\W^l)). \nonumber
\end{align}
%\end{small}
In simple language, the optimization problem in Eqs.~(\ref{obj}--\ref{constr1}) can be interpreted as follows:
First, note that task similarity has been calculated before learning the new task $t$, we can first determine the regimes of different local model spaces of old task $j$ and then construct the task $j^{(q)}$, which is the old task $j$ projected onto local model space corresponding to anchor point $s_q$. Next, we conservatively update the model for task $j^{(q)}$ in Regime 3 while using orthogonal projection to preserve the knowledge for the rest (cf. the objective function in \eqref{obj}. 
The scaled weight projection is used for old tasks in both Regime 2 and Regime 3 to facilitate {\em forward knowledge transfer} (cf. the constraint in \eqref{constr1}).
Note that one can always strike a good balance between adapting the model to new task while not forgetting the knowledge of the learnt tasks by adjusting the regularization parameter $\theta$.
The overview of our LMSP-based efficient continual learning framework is described in Algorithm~\ref{alg} (see next page).

\begin{algorithm}[t!]
\caption{Efficient Continual Learning with Local Model Space Projection (LMSP).}\label{alg}
\begin{algorithmic}[1]
\STATE Input: task sequence $\mathbb{T} = \{t\}_{t=0}^T$;
\STATE Learn first $j \in [0, t-1]$ task using vanilla stochastic gradient descent;
\FOR{each old task $j$}
    \STATE Sample $m$ anchor point
    \STATE Extract basis $\S^{l, (q)}_{j}$ for each local model space $q$ using the learnt model $\W_j$
\ENDFOR
\FOR{each new task $t$}
    \STATE Calculate gradient $\nabla \mathcal{L}_t(\W_{t-1})$;
    \STATE Evaluate the {\em local sufficient projection} and {\em local positive correlation} conditions for layer-wise correlation computation to determine its membership in $\mathrm{Reg}^l_{t, 1}$, $\mathrm{Reg}^l_{t, 2}$ or $\mathrm{Reg}^l_{t, 3}$;
    \FOR{$k=1,2,$...}
        \STATE Update the model and scaling matrices by solving Eq.~\eqref{obj};
    \ENDFOR
\ENDFOR
\STATE Output: The learnt model $\W_t$, scaling matrices $\{\Q_{j, t}^{l, (q)}\}_{l, j^{(q)}\in \mathrm{Reg}^l_{t, 3} \bigcup \mathrm{Reg}^l_{t, 3}}$;
\end{algorithmic}
\end{algorithm}

\subsection{Theoretical Performance Analysis} \label{subsec:thms}

In this subsection, we will establish the convergence rate and backward knowledge transfer of our proposed LMSP-based orthogonal projection approach.
Without loss of generality, consider the scenario of learning two consecutive tasks 1 and 2.
Note that since \citep{lin2022beyond} has already conducted theoretical analysis for the vanilla GD-type update (cf. Rule~\#2 in \citep{lin2022beyond}), which is also applicable in our work, we will only focus on the major difference in our work, which lies in the analysis for the local low-rank and full-rank orthogonal-projection-based updates.

For simplicity, consider the scenario with a sequence of two tasks 1 and 2. Let $\mathcal{F}(\W) = \mathcal{L}(\W, \mathcal{D}_1) + \mathcal{L}(\W, \mathcal{D}_2)$, $\boldsymbol{g}_1(\W) = \nabla_{\W} \mathcal{L}(\W, \mathcal{D}_1)$ and $\boldsymbol{g}_2(\W) = \nabla_{\W} \mathcal{L}(\W, \mathcal{D}_2)$. 
%Note that $\boldsymbol{\bar{g}}(\W^{(k)})=\boldsymbol{g}(\W^{(k)})-\mathrm{Proj}_{K_h^{(q)}D_j}(\boldsymbol{g}(\W^{(k)}))$ as the gradients for the local low-rank orthogonal-projection-based updates in Eq.~\eqref{local_rule1} as well as Eq.~\eqref{local_rule2}, and $\boldsymbol{\ddot{g}}(\W^{(k)})=\boldsymbol{g}(\W^{(k)})-\mathrm{Proj}_{D_j}(\boldsymbol{g}(\W^{(k)}))$ as the gradients for the full-rank orthogonal-projection-based updates in Eq.~\eqref{global_rule1} as well as Eq.~\eqref{global_rule2}.
Note that $\boldsymbol{\bar{g}}(\W^{(k)})=\boldsymbol{g}(\W^{(k)})-\mathrm{Proj}_{K_h^{(q)}D_j}(\boldsymbol{g}(\W^{(k)}))$ as the gradients for the local low-rank orthogonal-projection-based updates in Eq.~\eqref{local_rule1} as well as Eq.~\eqref{local_rule2}, and $\boldsymbol{\ddot{g}}(\W^{(k)})=\boldsymbol{g}(\W^{(k)})-\mathrm{Proj}_{D_j}(\boldsymbol{g}(\W^{(k)}))$ as the gradients for the full-rank orthogonal-projection-based updates under Regime 1 and Regime 2 in \citep{lin2022beyond}.
Thus, we let $k \in [0, K-1]$ denote the step index and use $\W_1$ to denote the model parameters for task 1, with $\W_1 = \W^{(0)}$ for the initialization of the new task model weights.
We first state our major convergence rate result for the local low-rank orthogonal-projection-based update as follows:

\begin{theorem}\label{theorem_1}
Suppose the loss function $\mathcal{L}$ is $B$-Lipschitz and $\frac{H}{2}$-smooth. Let $\alpha \le \min \{\frac{1}{H}, \frac{\gamma \| \boldsymbol{\bar{g}}_1(\W^{(0)}\|}{HBK}\}$ and $\lambda_1 \ge \\ \sqrt{1- 2\frac{2\|\boldsymbol{\bar{g}}_2(\W^{(0)})\|-\|\boldsymbol{\bar{g}}_1(\W^{(0)})\|}{\gamma^2 \|\boldsymbol{\bar{g}}_1(\W^{(0)}\|}}$ for some $\gamma \in (0, 1)$. Then, the following results hold:

%\smallskip
(1) if $\mathcal{L}$ is convex, the local low-rank orthogonal-projection-based update in Regimes 1 and 2 for task 2 converges to the optimal model $\W^{\star} = \arg \min \mathcal{F}(\W)$;

%\smallskip
(2) if $\mathcal{L}$ is non-convex, the local low-rank orthogonal-projection-based update in Regimes 1 and 2 for task 2 converges to a first-order stationary point:
\begin{multline*}
\min_k\|\nabla \mathcal{F}(\W^{(k)})\|^2 \le  \frac{2}{\alpha K}\sum_{k=0}^{K-1}[\mathcal{F}(\W^{(k)}) - \mathcal{F}(\W^\star)] + \\
\frac{[2\!+\!\gamma^2(5\!-\!\lambda_1^2)]}{2}\|\boldsymbol{\bar{g}}_1(\W^{(0)})\|^2 \!+\!4 \sum_{i=1}^{2}\|\boldsymbol{g}_i(\W^{(0)})\|^2.
\end{multline*}

\end{theorem}
Theorem \ref{theorem_1} characterizes the convergence of the joint objective function $\mathcal{F}(\W)$ when updating the model with local low-rank orthogonal-projection-based updates %\textbf{(Rule \#1)} 
in the convex setting, as well as the convergence to a first-order stationary point in the non-convex setting when the $q$-th local model of task 1 and task 2 satisfy the local sufficient projection definition with certain $\lambda_1$. 
Hence, it benefits the joint learning of tasks 1 and 2. The proof of Theorem~\ref{theorem_1} is relegated to Appendix~\ref{P1} due to space limitations.
The next result establishes the backward knowledge transfer of our CL approach:

\begin{theorem}\label{theorem_2}
Suppose loss $\mathcal{L}$ is $B$-Lipschitz and $\frac{H}{2}$-smooth. 
Then, the following results hold:

%\smallskip
(1) Let $\W^s$ and $\W^c$ be the model parameters after one update to an initial model $\W$ by using local low-rank and full-rank orthogonal-projection-based updates, respectively. 
Suppose the new task satisfy local relative orthogonality for a $\lambda_3 \in (0, 1)$, i.e.,
$\|Proj_{K_h^{(q)}D_1}(\boldsymbol{g}_2(\W^{(i)}))\|_2 = \lambda_3 \|Proj_{D_1}(\boldsymbol{g}_2(\W^{(i)}))\|_2$ for $i \!\in\! [0, k\!-\!1]$, $\alpha \!\le\! \min \{\frac{1}{H}, \frac{\gamma \| \boldsymbol{\bar{g}}_1(\W^{(0)})\|}{HBK}\}$ and $\lambda_1 \!\ge\! \max\{$ $\sqrt{1\!-\! 2\frac{2\|\boldsymbol{\bar{g}}_2(\W^{(0)})\|\!-\!\|\boldsymbol{\bar{g}}_1(\W^{(0)})\|}{\gamma^2 \|\boldsymbol{\bar{g}}_1(\W^{(0)})\|}}, \!\sqrt{1 \!-\! \frac{(1-\lambda_3^2)(2+\alpha H){\lambda^\prime}_1^2}{1+2\alpha H}} \big\}$, then we have $\mathcal{F}(\W^s) \le \mathcal{F}(\W^c)$;

%\smallskip
(2) Let $\W^{(k)}$ be the $k$-th iterate for task 2 with the $\theta$-regularized update in Regime 3. 
If $\alpha \le \frac{4 \| \boldsymbol{\bar{g}}_1(\W^{(0)})\|}{HBk^{1.5}}$, then $\mathcal{L}_1(\W^{(k)}) \le  \mathcal{L}_1(\W_1) = \mathcal{L}_1(\W^{(0)})$.
\end{theorem}

The first claim in Theorem~\ref{theorem_2} indicates that updating the model using the local low-rank orthogonal-projection-based updates
achieves lower loss value than the full-rank orthogonal-projection-based updates %\textbf{(Rule \#1s)} 
when the $q$-th local model of task 1 and task 2 satisfy the sufficient projection with some $\lambda_1$ and the local relative orthogonality in Definition \ref{defn:LocalRelOrth} with some $\lambda_3$, hence implying {\em backward knowledge transfer.}
The second claim in Theorem~\ref{theorem_2} suggests that the local low-rank orthogonal-projection-based update
results in a better model for task 1 with respect to $\mathcal{L}_1$. 
The proofs of Theorem \ref{theorem_2} is also relegated to Appendix \ref{P2} due to space limitation.

Next, we provide the approximation accuracy analysis and comparison of the loss functions between applying local low-rank and full-rank orthogonal-projection-based updates.

Without loss of generality, for any anchor point $s_q$, we let $B_h(s_q)$ denote the neighborhood of indices near that anchor point, which is defined as $B_h(s_q) \defeq \{ \forall s^\prime \in [M] \times [N]: d(s_q, s^\prime) < h \}$ and we use $M(h, s_q)$ and $N(h, s_q)$ to denote the number of unique row and column indices in $\mathcal{B}_h(s_q)$. Also, we denote $n_q = \min(M(h, s_q), N(h, s_q))$. Then we have the following theorem for approximation accuracy:

\begin{figure*}[t!]
\centering
\includegraphics[width=1.0\linewidth]{Fig/new.pdf}
%\vspace{-.1in}
\caption{Ablation studies on rank and number of anchor points.
\label{ablation_study}}
%\vspace{-.25in}
\end{figure*}

\begin{table*}[t!]
  \caption{The ACC and BWT performance comparisons between LMSP (ours) and baselines.}
  \label{result_table}
  \centering
  {\footnotesize
  \begin{tabular}{lllllllll}
    \toprule
    \multirow{2}{*}{Method} & \multicolumn{2}{c}{PMNIST} & \multicolumn{2}{c}{CIFAR-100 Split}  & \multicolumn{2}{c}{5-Dataset} & \multicolumn{2}{c}{MiniImageNet} \\
    & ACC(\%) & BWT(\%) & ACC(\%) & BWT(\%) & ACC(\%) & BWT(\%) & ACC(\%) & BWT(\%) \\
    \midrule
    Multitask & 96.70 & - & 79.58 & - & 91.54 & - & 69.46 & - \\
    \midrule
    OWM & 90.71 & -1 & 50.94 & -30 & - & - & - & - \\
    EWC & 89.97 & -4 & 68.80 & -2 & 88.64 & -4 & 52.01 & -12 \\
    HAT & - & - & 72.06 & 0 & 91.32 & -1 & 59.78 & -3 \\
    A-GEM & 83.56 & -14 & 63.98 & -15 & 84.04 & -12 & 57.24 & -12 \\
    ER-Res & 87.24 & -11 & 71.73 & -6 & 88.31 & -4 & 58.94 & -7 \\
    GPM & 93.91 & -3 & 72.48 & -0.9 & 91.22 & -1 & 60.41 & -0.7 \\
    TRPG & 96.26 & -1.01 & 74.98 & -0.15 & 92.41 & -0.08 & \textbf{64.46} & -0.89 \\
    CUBER & 97.04 & -0.11 & \textbf{75.29} & 0.14 & 92.85 & -0.13 & 63.67 & 0.11 \\
    {\bf LMSP($r=25$)} & \textbf{97.48} & \textbf{0.16} & 74.21 & \textbf{0.94} & \textbf{93.78} & \textbf{0.07} & 64.2 & \textbf{1.55} \\
    \bottomrule
  \end{tabular}
  }
\end{table*}

\begin{theorem}\label{theorem_3}
Suppose loss $\mathcal{L}$ is $B$-Lipschitz and $\frac{H}{2}$-smooth. Let $\W^s$ and $\W^c$ be the model parameters after one update to an initial model $\W$ by using local low-rank and full-rank orthogonal-projection-based updates, respectively. Given the mapping function $T(s) = \mathcal{T}^2 = \R_j^l{\R_j^l}^\top$ which represents the gram matrix of the original matrix, is Hölder continuous with parameter $Z, \beta > 0$. Let $\alpha \le \min \{\frac{1}{H}, \frac{\gamma \| \boldsymbol{g}_1(\W^{(0)})\|}{HBK}\}$ for some $\gamma \in (0, 1)$. 
Then, the value of loss discrepancy between full-rank and local low-rank orthogonal-projection-based updates corresponding with anchor point $s_q$, i.e.,
\begin{align}
\mathcal{E}(\mathcal{F})(s_q, h) = \mathcal{F}(\W^c) - \mathcal{F}(\W^s),
\end{align}
is upper bounded as:
\begin{align}\label{theorem:upper}
\mathcal{E}(\mathcal{F})(s_q, h) \le& HZ^2 h^{2\beta}(24n_q+9)B^2 \nonumber \\
&+Z h^\beta(4\sqrt{3n_q\!+\!2})\bigg[\frac{2 + \gamma^2}{4} \|\boldsymbol{g}_1 (\W^{(0)})\|^2 \nonumber\\ 
&+ \|\boldsymbol{g}_1 (\W^{(0)})\|\|\boldsymbol{g}_2(\W^{(0)})\| + \frac{3}{2}B^2\bigg],
\end{align}
and lower bounded as:
\begin{align}\label{theorem:lower}
&\mathcal{E}(\mathcal{F})(s_q, h) \ge -HZ^2 h^{2\beta}(24n_q+9)B^2 \nonumber \\
&\hspace{.2in} +Z h^\beta(4\sqrt{3n_q+2})\bigg[-\frac{2 + \gamma^2}{4} \|\boldsymbol{g}_1 (\W^{(0)})\|^2 \nonumber\\
&\hspace{.2in} + \|\boldsymbol{g}_1 (\W^{(0)})\|\|\boldsymbol{g}_2(\W^{(0)})\| + \frac{1}{2}B^2\bigg].\!\!\!
\end{align}
\end{theorem}
Theorem~\ref{theorem_3} provides the approximation accuracy bounds between using original full-rank and local low-rank updates, which indicates that error introduced by the local model space projection can be bounded and the bound is mostly influenced by the first squared term in both Eqs.~\eqref{theorem:upper} and~\eqref{theorem:lower}. 
Noting that this term is the squared bound for matrix completion based on \citep{lee2013local}, the LMSP loss error bound is roughly on the order of the square of the matrix completion bound due to the inner product calculation of basis in projection computation. 
The proof of Theorem~\ref{theorem_3} is also relegated to Appendix \ref{P3} due to space limitation.

\section{Numerical Results}\label{Section: experiment}

In this section, we conduct experiments to verify the efficacy of our proposed research.
We will first discuss our experiment settings, including datasets, baselines, and evaluation metrics, which are followed by experimental results.

%\smallskip
\textbf{1)~Datasets:} We evaluate the performance of our LMSP on four public
datasets for CL: (1) Permuted MNIST~\citep{lecun2010mnist}; 
(2) CIFAR-100 Split~\citep{krizhevsky2009learning};
(3) 5-Datasets~\citep{lin2022beyond,lin2022trgp};
and (4) MiniImageNet~\citep{vinyals2016matching}.
Due to space limitations, the detailed dataset information is relegated to Appendix~\ref{T5}. 

%\smallskip
\textbf{2)~Baseline Methods:} We compare our LMSP method with the following baseline methods: 
%\vspace{-.1in}
\begin{list}{\labelitemi}{\leftmargin=1.8em \itemindent=-0.1em \itemsep=-.2em}
\item[(1)] {\em EWC}~\citep{kirkpatrick2017overcoming}: EWP adopts the Fisher information matrix for weights importance evaluation.
\item[(2)]~{\em HAT}~\citep{serra2018overcoming}: HAT preserves the knowledge of an old task by learning a hard attention mask; 
\item[(3)]~{\em Orthogonal Weight Modulation (OWM)} \citep{zeng2019continual}: OWM projects the gradient of a new task to the orthogonal direction of the input subspace of an old task by learning a projector matrix; 
\item[(4)] {\em Gradient Projection Memory (GPM)} \citep{saha2021gradient}: GPM first stores old tasks' basis of the input subspace, and then uses the gradient projection orthogonal to the subspace spanned by stored basis to update the model; 
\item[(5)]~{\em TRGP}~\citep{lin2022trgp}: TRGP uses a scaled weight projection to facilitate the forward knowledge transfer from related old tasks to the new task;
\item[(6)]~{\em CUBER}~\citep{lin2022beyond}: CUBER categorizes the tasks as strong projection and positive correlation.
\item[(7)]~{\em Averaged GEM
(A-GEM)}~\citep{chaudhry2018efficient}: A-GEM stores and incorporates old tasks' data in computing gradients for the new task's learning; 
\item[(8)] {\em Experience Replay with Reservoir sample (ER-Res)}~\citep{chaudhry2019continual}: ER-Res uses a small episodic memory to store old task samples to address the forgetting problem; 
\item[(9)]~{\em Multitask}~\citep{saha2021gradient}: Multitask jointly learns all tasks once with a single network using all datasets.
\end{list}

\begin{table*}[!t]
  %\captionsetup{labelfont={color=blue},font={color=blue}}
  \caption{Training time comparison on CIFAR-100 Split, 5-Datasets and MiniImageNet. Here the
training time is normalized with respect to the value of GPM. Please refer \citep{saha2021gradient} for more specific time.}
  \label{table:training_time}
  \centering
  %\color{blue}
  {\begin{tabular}{llllllllllll}
    \toprule
    Training time & OWM & EWC & HAT & A-GEM & ER-Res & GPM & TRPG & CUBER & \makecell{\textbf{LMSP} \\ (r=5)} & \makecell{\textbf{LMSP} \\ (r=20)} & \makecell{\textbf{LMSP} \\ (r=25)} \\
    \midrule
    PMNIST & - & 1.49 & 1.23 & 2.57 & 1.34 & 1 & 1.37 & 1.52 & \textbf{0.37} & 0.48 & 0.52\\
    Cifar-100 Split & 2.41 & 1.76 & 1.62 & 3.48 & 1.49 & 1 & 1.65 & 1.86 & \textbf{0.24} & 0.41 & 0.46\\
    5-Dataset & - & 1.52 & 1.47 & 2.41 & 1.40 & 1 & 1.21 & 1.55 & \textbf{0.42} & 0.63 & 0.67\\
    MiniImageNet & - & 1.22 & 0.91 & 1.79 & 0.82 & 1 & 1.34 & 1.61 & \textbf{0.18} & 0.30 & 0.33\\
    \bottomrule
  \end{tabular}}
\end{table*}

%\smallskip
\textbf{3)~Evaluation Metrics:}
We use the following two metrics to evaluate the learning performance of the baseline models and our model: 
(1)~Accuracy (ACC), which is the final averaged accuracy over all tasks; 
(2)~Backward transfer (BWT), which is the average accuracy change of each task after learning the new task.
\begin{align*}
& ACC = \frac{1}{T}\sum^T_{i=1} A_{T, i}, \\
& BWT = \frac{1}{T-1}\sum^{T-1}_{i=1} (A_{T, i} - A_{i, i}),    
\end{align*}
where $A_{i, j}$ denotes the testing accuracy of task $j$ upon the completion of learning task $i$.


%\smallskip
{\bf 4)~Experimental Results:}
We can see from Table \ref{result_table} that our LMSP method outperforms other baseline methods in both ACC and BWT. 
It is worth noting that the BWT performance in our method is generally better than CUBER. 
This improvement stems from our approach of dividing old tasks into multiple local tasks, making it easier to identify highly correlated local tasks for the new task.
%The reason is that our method could significantly extend the old task candidates, hence much easier to find some strongly correlated old tasks for the new task.
%\subsection{Ablation Study}
To understand the efficacy of the proposed techniques, we further conduct ablation studies.
We show the effects with different rank values and number of anchor points for our approach in Fig.~\ref{ablation_study}.
Due to space limitation, we relegate the ablation study results with different kernel types to the Appendix \ref{T4}.

%\smallskip
\textit{4-1)~Effect of Low Rank}: Fig.~\ref{ablation_study}(a) and (c) show the results of our method using a different low-rank value $r$. 
We can see that, as expected, the model's performance becomes better when the rank becomes higher. 
In general, a higher rank value implies less information loss during the base construction. 
Further, as the rank value becomes sufficiently high, the performance improvement becomes insignificant since most of the information has already been included.

%\smallskip
\textit{4-2)~Effect of Anchor Point Number}: Fig.~\ref{ablation_study}(b) and (d) illustrate the performance of our LMSP method with a different number of anchor points. 
We can see that more anchor points lead to better performance since more candidate old tasks are generated, thus it would be easier to find more correlated old tasks with the new task.
However, as the number of anchor points increases, the computation cost also increases correspondingly, which implies a trade-off between performance and cost.

\textit{4-3)~Results of training time}: 
We show the results of forward knowledge transfer(FWT) in Table \ref{table:training_time}. As shown in the table, we summarize the normalized wall-clock training times of our LMSP algorithm and several baselines with respect to the wall-clock training time of GPM (additional wall-clock training time results can also be found in \citep{saha2021gradient}). Here, we set the rank $r$ to $5, 20, 25$ for each local model. We can see that the wall-clock time of our LMSP($r=5$) method with \textit{only one anchor point} can already reduce the total wall-clock training time of CUBER by 86\% on average. Moreover, thanks to the fact that our LMSP approach endows distributed implementation that can run different local models in a parallel fashion, the total walk-clock training time with $m$ anchor points is similar to the single-anchor-point case above. 

It is worth mentioning that LMSP achieves comparable training time even with $r=25$. Furthermore, the results in Fig.~\ref{ablation_study} and Table \ref{result_table} indicate that LMSP with $r=5$ performs on par with TRPG and outperforms GPM in terms of average ACC and BWT, suggesting that a lower rank does not significantly compromise the performance of our method. Our local low-rank-based methods also demonstrate improved efficiency, particularly when compared to CUBER, which relies on a full-rank setting without leveraging local low-rank strategies. 
In conclusion, results in Table \ref{table:training_time} demonstrate the effectiveness of our LMSP approach in terms of computation cost reductions comparing to the original layer-wise full-rank orthogonal-projection-based approach.

Due to space limitation, we relegate the results of forward knowledge transfer for our LMSP approach in Appendix \ref{T6}. 

\section{Conclusion}\label{Section: conclusion}
In this paper, we proposed a new efficient local low-rank orthogonal-projection-based continual learning strategy based on local model space projection (LMSP), which not only reduces the complexity of basis computation but also enables forward and backward knowledge transfers.
We conducted a theoretical analysis to show that the new task's performance could benefit from the local old tasks more than just using the global old task under certain circumstances. 
We also provided a training loss error analysis and showed that the approximation accuracy of  LMSP compared to the original full-rank orthogonal-projection-based approach can be both upper and lower bounded.
Our extensive experiments on public datasets demonstrated the efficacy of our approach. 
Future work includes deploying our efficient CL method to some popular deep learning structures such as transformers and large language models (LLMs) and extending our approach to more general CL settings.

% References
\bibliography{uai2025-template}

\newpage

\onecolumn

\title{Appendix}
\maketitle

\appendix
\section{Proof of Theorem \ref{theorem_1}}\label{P1}

\begin{proof}
For a $\frac{H}{2}$-smooth loss function $\mathcal{L}$, it can be easily shown that $\mathcal{F}$ is $H$-smooth.
(1) For any $k \in [0, K]$, we can have:
\begin{align}\label{eq:2}
    \mathcal{F}(\W^{(k+1)}) & \le \mathcal{F}(\W^{(k)}) + \nabla \mathcal{F}(\W^{(k)})^\top(\W^{(k+1)} - \W^{(k)}) + \frac{H}{2}\|\W^{(k+1)} - \W^{(k)}\|^2 \nonumber \\
    & =\mathcal{F}(\W^{(k)}) + (\boldsymbol{g}_1(\W^{(k)}) + \boldsymbol{g}_2(\W^{(k)}))^\top(-\alpha \boldsymbol{\bar{g}}_2(\W^{(k)})) + \frac{\alpha^2 H}{2}\|\boldsymbol{\bar{g}}_2(\W^{(k)})\|^2 \nonumber \\
    & =\mathcal{F}(\W^{(k)}) - [\alpha - \frac{\alpha^2 H}{2}]\|\boldsymbol{\bar{g}}_2(\W^{(k)})\|^2 - \alpha \langle \boldsymbol{\bar{g}}_1(\W^{(k)}), \boldsymbol{\bar{g}}_2(\W^{(k)}) \rangle, 
\end{align}
since:
\begin{align}
    \langle \boldsymbol{g}_1(\W^{(k)}), \boldsymbol{\bar{g}}_2(\W^{(k)}) \rangle = \langle \mathrm{Proj}_{K_h^{(q)}D_1}(\boldsymbol{g}_1(\W^{(k)})), \boldsymbol{\bar{g}}_2(\W^{(k)}) \rangle + \langle \boldsymbol{\bar{g}}_1(\W^{(k)}), \boldsymbol{\bar{g}}_2(\W^{(k)}) \rangle,
\end{align}
\begin{align}
    \langle \boldsymbol{g}_2(\W^{(k)}), \boldsymbol{\bar{g}}_2(\W^{(k)}) \rangle = \langle \mathrm{Proj}_{K_h^{(q)}D_1}(\boldsymbol{g}_2(\W^{(k)})), \boldsymbol{\bar{g}}_2(\W^{(k)}) \rangle + \langle \boldsymbol{\bar{g}}_2(\W^{(k)}), \boldsymbol{\bar{g}}_2(\W^{(k)}) \rangle,
\end{align}
and:
\begin{align}
    \langle \mathrm{Proj}_{K_h^{(q)}D_1}(\boldsymbol{g}_1(\W^{(k)})), \boldsymbol{\bar{g}}_2(\W^{(k)}) \rangle = 0,
\end{align}
\begin{align}
    \langle \mathrm{Proj}_{K_h^{(q)}D_1}(\boldsymbol{g}_2(\W^{(k)})), \boldsymbol{\bar{g}}_2(\W^{(k)}) \rangle = 0.
\end{align}
For the term $\langle \boldsymbol{\bar{g}}_2(\W^{(k)}), \boldsymbol{\bar{g}}_2(\W^{(k)}) \rangle$, it follows that:
\begin{align}\label{eq:1}
    &\langle \boldsymbol{\bar{g}}_1(\W^{(k)}), \boldsymbol{\bar{g}}_2(\W^{(k)}) \rangle \nonumber \\
    = & \langle \boldsymbol{\bar{g}}_1(\W^{(k)}) - \boldsymbol{\bar{g}}_1(\W^{(0)}) + \boldsymbol{\bar{g}}_1(\W^{(0)}), \boldsymbol{\bar{g}}_2(\W^{(k)}) \rangle \nonumber \\
    = & \langle \boldsymbol{\bar{g}}_1(\W^{(k)}) - \boldsymbol{\bar{g}}_1(\W^{(0)}), \boldsymbol{\bar{g}}_2(\W^{(k)}) \rangle + \langle \boldsymbol{\bar{g}}_1(\W^{(0)}), \boldsymbol{\bar{g}}_2(\W^{(k)}) \rangle \nonumber \\
    = & \langle \boldsymbol{\bar{g}}_1(\W^{(k)}) - \boldsymbol{\bar{g}}_1(\W^{(0)}), \boldsymbol{\bar{g}}_2(\W^{(k)}) \rangle + \langle \boldsymbol{\bar{g}}_1(\W^{(0)}), \boldsymbol{\bar{g}}_2(\W^{(k)}) - \boldsymbol{\bar{g}}_2(\W^{(0)}) \rangle + \langle \boldsymbol{\bar{g}}_1(\W^{(0)}), \boldsymbol{\bar{g}}_2(\W^{(0)}) \rangle.
\end{align}
Considering
\begin{align}
    & 2\langle \boldsymbol{\bar{g}}_1(\W^{(k)}) - \boldsymbol{\bar{g}}_1(\W^{(0)}), \boldsymbol{\bar{g}}_2(\W^{(k)}) \rangle + \|\boldsymbol{\bar{g}}_1(\W^{(k)}) - \boldsymbol{\bar{g}}_1(\W^{(0)})\|^2 + \|\boldsymbol{\bar{g}}_2(\W^{(k)})\|^2 \nonumber \\
    = & \|\boldsymbol{\bar{g}}_1(\W^{(k)}) - \boldsymbol{\bar{g}}_1(\W^{(0)}) +  \boldsymbol{\bar{g}}_2(\W^{(k)})\|^2 \ge 0,
\end{align}
we have:
\begin{align}\label{ineq:1}
    \langle \boldsymbol{\bar{g}}_1(\W^{(k)}) - \boldsymbol{\bar{g}}_1(\W^{(0)}), \boldsymbol{\bar{g}}_2(\W^{(k)}) \rangle \ge -\frac{1}{2}\|\boldsymbol{\bar{g}}_1(\W^{(k)}) - \boldsymbol{\bar{g}}_1(\W^{(0)})\|^2 -\frac{1}{2}\|\boldsymbol{\bar{g}}_2(\W^{(k)})\|^2,
\end{align}
and similarly:
\begin{align}\label{ineq:2}
    \langle \boldsymbol{\bar{g}}_1(\W^{(0)}), \boldsymbol{\bar{g}}_2(\W^{(k)}) - \boldsymbol{\bar{g}}_2(\W^{(0)}) \rangle \ge -\frac{1}{2}\|\boldsymbol{\bar{g}}_2(\W^{(k)}) - \boldsymbol{\bar{g}}_2(\W^{(0)})\|^2 -\frac{1}{2}\|\boldsymbol{\bar{g}}_1(\W^{(0)})\|^2.
\end{align}
Combining Eq.~\eqref{eq:1}, Eq.~\eqref{ineq:1} and Eq.~\eqref{ineq:2} gives a lower bound on $\langle \boldsymbol{\bar{g}}_1(\W^{(k)}), \boldsymbol{\bar{g}}_2(\W^{(k)}) \rangle$, i.e.,
\begin{align}\label{eq:inner}
    & \langle \boldsymbol{\bar{g}}_1(\W^{(k)}), \boldsymbol{\bar{g}}_2(\W^{(k)}) \rangle \nonumber \\
    \ge & -\frac{1}{2}\|\boldsymbol{\bar{g}}_1(\W^{(k)}) - \boldsymbol{\bar{g}}_1(\W^{(0)})\|^2 -\frac{1}{2}\|\boldsymbol{\bar{g}}_2(\W^{(k)})\|^2 \nonumber \\
    & -\frac{1}{2}\|\boldsymbol{\bar{g}}_2(\W^{(k)}) - \boldsymbol{\bar{g}}_2(\W^{(0)})\|^2 -\frac{1}{2}\|\boldsymbol{\bar{g}}_1(\W^{(0)})\|^2 + \langle \boldsymbol{\bar{g}}_1(\W^{(0)}), \boldsymbol{\bar{g}}_2(\W^{(0)}) \rangle \nonumber \\
    \ge & -\frac{H^2(1-\lambda_1^2)}{8}\|\W^{(k)} - \W^{(0)}\|^2 - \frac{1}{2}\|\boldsymbol{\bar{g}}_2(\W^{(k)})\|^2 \nonumber \\
    & -\frac{H^2(1-\lambda_1^2)}{8}\|\W^{(k)} - \W^{(0)}\|^2 - \frac{1}{2}\|\boldsymbol{\bar{g}}_1(\W^{(0)})\|^2 + \langle \boldsymbol{\bar{g}}_1(\W^{(0)}), \boldsymbol{\bar{g}}_2(\W^{(0)}) \rangle \nonumber \\
    \ge & -\frac{H^2(1-\lambda_1^2)}{4}\|\W^{(k)} - \W^{(0)}\|^2 - \frac{1}{2}\|\boldsymbol{\bar{g}}_2(\W^{(k)})\|^2 - \frac{1}{2}\|\boldsymbol{\bar{g}}_1(\W^{(0)})\|^2 + \langle \boldsymbol{\bar{g}}_1(\W^{(0)}), \boldsymbol{\bar{g}}_2(\W^{(0)}) \rangle,
\end{align}
where the second inequality is true due to the smoothness of the loss function and:
\begin{align}
    \|\boldsymbol{\bar{g}}_1(\W^{(k)}) - \boldsymbol{\bar{g}}_1(\W^{(0)})\|^2 & = \|\boldsymbol{g}_1(\W^{(k)}) - \boldsymbol{g}_1(\W^{(0)})\|^2 - \|\mathrm{Proj}_{K_h^{(q)}D_1}(\boldsymbol{g}_1(\W^{(k)}) - \boldsymbol{g}_1(\W^{(0)}))\|^2 \nonumber \\
    & \le (1 - \lambda_1^2)\|\boldsymbol{g}_1(\W^{(k)}) - \boldsymbol{g}_1(\W^{(0)})\|^2,
\end{align}
as well as
\begin{align}
    \|\boldsymbol{\bar{g}}_2(\W^{(k)}) - \boldsymbol{\bar{g}}_2(\W^{(0)})\|^2 \le (1 - \lambda_1^2)\|\boldsymbol{g}_2(\W^{(k)}) - \boldsymbol{g}_2(\W^{(0)})\|^2.
\end{align}
Based on the local low-rank orthogonal-projection-based update, it can be seen that:
\begin{align}\label{eq:diff}
    \W^{(k)} = \W^{(0)} - \alpha \sum_{i=0}^{k-1}\boldsymbol{\bar{g}}_2(\W^{(i)}).
\end{align}
Therefore, continuing with Eq.~\eqref{eq:2}, we have:
\begin{align}
    & \mathcal{F}(\W^{(k+1)}) \nonumber \\
    \le & \mathcal{F}(\W^{(k)}) - [\alpha - \frac{\alpha^2 H}{2}]\|\boldsymbol{\bar{g}}_2(\W^{(k)})\|^2 - \alpha \langle \boldsymbol{\bar{g}}_1(\W^{(k)}), \boldsymbol{\bar{g}}_2(\W^{(k)}) \rangle \nonumber \\
    \le &  \mathcal{F}(\W^{(k)}) - [\frac{\alpha}{2} - \frac{\alpha^2 H}{2}]\|\boldsymbol{\bar{g}}_2(\W^{(k)})\|^2 + \frac{\alpha^3 H^2(1-\lambda_1^2)}{4}\|\sum_{i=0}^{k-1}\boldsymbol{\bar{g}}_2(\W^{(i)})\|^2 + \frac{\alpha}{2}\|\boldsymbol{\bar{g}}_1(\W^{(0)})\|^2 \nonumber \\
    & - \alpha \|\boldsymbol{\bar{g}}_1(\W^{(0)})\|\|\boldsymbol{\bar{g}}_2(\W^{(0)})\|,
\end{align}
where the last term is based on the definition of projection.
Since
\begin{align}
    \alpha \le \frac{\gamma \| \boldsymbol{\bar{g}}_1(\W^{(0)})\|}{HBK} \le \frac{\gamma \| \boldsymbol{\bar{g}}_1(\W^{(0)})\|}{H\|\sum_{i=0}^{k-1}\boldsymbol{\bar{g}}_2(\W^{(i)})\|},
\end{align}
thus
\begin{align}
    & \frac{1}{2}\|\boldsymbol{\bar{g}}_1(\W^{(0)})\|^2 + \frac{\alpha^2 H^2(1-\lambda_1^2)}{4}\|\sum_{i=0}^{k-1}\boldsymbol{\bar{g}}_2(\W^{(i)})\|^2 \nonumber \\
    \le & \frac{1}{2}\|\boldsymbol{\bar{g}}_1(\W^{(0)})\|^2 + \frac{\gamma^2(1-\lambda_1^2)\|\boldsymbol{\bar{g}}_1(\W^{(0)})\|^2}{ 4H^2\|\sum_{i=0}^{k-1}\boldsymbol{\bar{g}}_2(\W^{(i)})\|^2}H^2\|\sum_{i=0}^{k-1}\boldsymbol{\bar{g}}_2(\W^{(i)})\|^2 \nonumber \\
    = & \frac{2+\gamma^2(1-\lambda_1^2)}{4}\|\boldsymbol{\bar{g}}_1(\W^{(0)})\|^2.
\end{align}
Therefore, we can obtain that:
\begin{align}\label{eq:3}
    & \mathcal{F}(\W^{(k+1)}) \nonumber \\
    \le & \mathcal{F}(\W^{(k)}) - [\frac{\alpha}{2} - \frac{\alpha^2 H}{2}]\|\boldsymbol{\bar{g}}_2(\W^{(k)})\|^2 + \frac{\alpha[2+\gamma^2(1-\lambda_1^2)]}{4}\|\boldsymbol{\bar{g}}_1(\W^{(0)})\|^2 - \alpha \|\boldsymbol{\bar{g}}_1(\W^{(0)})\|\|\boldsymbol{\bar{g}}_2(\W^{(0)})\| \nonumber \\
    \le & \mathcal{F}(\W^{(k)}) - [\frac{\alpha}{2} - \frac{\alpha^2 H}{2}]\|\boldsymbol{\bar{g}}_2(\W^{(k)})\|^2 \nonumber \\
    \le & \mathcal{F}(\W^{(k)}),
\end{align}
where the second inequality is true because:
\begin{align}\label{eq:lambda_1}
    &\lambda_1 \ge \sqrt{1- 2\frac{2\|\boldsymbol{\bar{g}}_2(\W^{(0)})\|-\|\boldsymbol{\bar{g}}_1(\W^{(0)})\|}{\gamma^2 \|\boldsymbol{\bar{g}}_1(\W^{(0)})\|}} \nonumber \\
    \implies &  \frac{\alpha[2+\gamma^2(1-\lambda_1^2)]}{4}\|\boldsymbol{\bar{g}}_1(\W^{(0)})\|^2 - \alpha \|\boldsymbol{\bar{g}}_1(\W^{(0)})\|\|\boldsymbol{\bar{g}}_2(\W^{(0)})\| \le 0.
\end{align}
This sufficient decrease of the objective function value indicates that the optimal $\mathcal{F}(\W^\star)$ can be obtained for convex loss functions.

(2) For a non-convex loss function $\mathcal{L}$, as $\nabla \mathcal{F}(\W^{(k)}) = \boldsymbol{g}_1(\W^{(k)}) +\boldsymbol{g}_2(\W^{(k)})$ we have Eq.~\eqref{eq:3}:
\begin{align}
    & \mathcal{F}(\W^{(k+1)}) \nonumber \\
    \le & \mathcal{F}(\W^{(k)}) - [\frac{\alpha}{2} - \frac{\alpha^2 H}{2}]\|\boldsymbol{\bar{g}}_2(\W^{(k)})\|^2 + \frac{\alpha[2+\gamma^2(1-\lambda_1^2)]}{4}\|\boldsymbol{\bar{g}}_1(\W^{(0)})\|^2 - \alpha \|\boldsymbol{\bar{g}}_1(\W^{(0)})\|\|\boldsymbol{\bar{g}}_2(\W^{(0)})\| \nonumber \\
    & - \frac{\alpha}{2}[\|\nabla \mathcal{F}(\W^{(k)})\|^2 -  \|\boldsymbol{g}_1(\W^{(k)})\|^2 -  \|\boldsymbol{g}_2(\W^{(k)})\|^2 - 2\langle \boldsymbol{g}_1(\W^{(k)}), \boldsymbol{g}_2(\W^{(k)}) \rangle] \nonumber \\
    \le & \mathcal{F}(\W^{(k)}) - [\frac{\alpha}{2} - \frac{\alpha^2 H}{2}]\|\boldsymbol{\bar{g}}_2(\W^{(k)})\|^2 + \frac{\alpha[2+\gamma^2(1-\lambda_1^2)]}{4}\|\boldsymbol{\bar{g}}_1(\W^{(0)})\|^2 - \alpha \|\boldsymbol{\bar{g}}_1(\W^{(0)})\|\|\boldsymbol{\bar{g}}_2(\W^{(0)})\| \nonumber \\
    & - \frac{\alpha}{2}[\|\nabla \mathcal{F}(\W^{(k)})\|^2 -  2\|\boldsymbol{g}_1(\W^{(k)})\|^2 -  2\|\boldsymbol{g}_2(\W^{(k)})\|^2].
\end{align}
From Eq.~\eqref{eq:diff} we have
\begin{align}
    \|\boldsymbol{g}_1(\W^{(k)})\|^2 & = \|\boldsymbol{g}_1(\W^{(k)}) - \boldsymbol{g}_1(\W^{(0)}) + \boldsymbol{g}_1(\W^{(0)})\|^2 \le 2\|\boldsymbol{g}_1(\W^{(k)}) - \boldsymbol{g}_1(\W^{(0)})\|^2 + 2\|\boldsymbol{g}_1(\W^{(0)})\|^2 \nonumber \\
    & \le \frac{\alpha^2 H^2}{2}\|\sum_{i=0}^{k-1}\boldsymbol{g}_2(\W^{(i)})\|^2 + 2\|\boldsymbol{g}_1(\W^{(0)})\|^2 \nonumber \\
    & \le \frac{\gamma^2}{2}\|\boldsymbol{\bar{g}}_1(\W^{(0)})\|^2 + 2\|\boldsymbol{g}_1(\W^{(0)})\|^2,
\end{align}
and
\begin{align}
    \|\boldsymbol{g}_2(\W^{(k)})\|^2 & = \|\boldsymbol{g}_2(\W^{(k)}) - \boldsymbol{g}_2(\W^{(0)}) + \boldsymbol{g}_2(\W^{(0)})\|^2 \le 2\|\boldsymbol{g}_2(\W^{(k)}) - \boldsymbol{g}_2(\W^{(0)})\|^2 + 2\|\boldsymbol{g}_2(\W^{(0)})\|^2 \nonumber \\
    & \le \frac{\alpha^2 H^2}{2}\|\sum_{i=0}^{k-1}\boldsymbol{g}_2(\W^{(i)})\|^2 + 2\|\boldsymbol{g}_2(\W^{(0)})\|^2 \nonumber \\
    & \le \frac{\gamma^2}{2}\|\boldsymbol{\bar{g}}_1(\W^{(0)})\|^2 + 2\|\boldsymbol{g}_2(\W^{(0)})\|^2,
\end{align}
where the last inequality holds as
\begin{align}
    \alpha \le \frac{\gamma \| \boldsymbol{\bar{g}}_1(\W^{(0)})\|}{HBK} \le \frac{\gamma \| \boldsymbol{\bar{g}}_1(\W^{(0)})\|}{H\|\sum_{i=0}^{k-1}\boldsymbol{g}_2(\W^{(i)})\|}
\end{align}
Therefore
\begin{align}
    & \mathcal{F}(\W^{(k+1)}) \nonumber \\
    \le & \mathcal{F}(\W^{(k)}) - [\frac{\alpha}{2} - \frac{\alpha^2 H}{2}]\|\boldsymbol{\bar{g}}_2(\W^{(k)})\|^2 + \frac{\alpha[2+\gamma^2(1-\lambda_1^2)]}{4}\|\boldsymbol{\bar{g}}_1(\W^{(0)})\|^2 - \alpha \|\boldsymbol{\bar{g}}_1(\W^{(0)})\|\|\boldsymbol{\bar{g}}_2(\W^{(0)})\| \nonumber \\
    & - \frac{\alpha}{2}\|\nabla \mathcal{F}(\W^{(k)})\|^2 + 2\alpha\|\boldsymbol{g}_1(\W^{(0)})\|^2 +  2\alpha\|\boldsymbol{g}_2(\W^{(0)})\|^2+\alpha\gamma^2\| \boldsymbol{\bar{g}}_1(\W^{(0)}\|^2 \nonumber \\
    \le & \mathcal{F}(\W^{(k)}) - [\frac{\alpha}{2} - \frac{\alpha^2 H}{2}]\|\boldsymbol{\bar{g}}_2(\W^{(k)})\|^2 + \frac{\alpha[2+\gamma^2(5-\lambda_1^2)]}{4}\|\boldsymbol{\bar{g}}_1(\W^{(0)})\|^2 - \alpha \|\boldsymbol{\bar{g}}_1(\W^{(0)})\|\|\boldsymbol{\bar{g}}_2(\W^{(0)})\| \nonumber \\
    & - \frac{\alpha}{2}\|\nabla \mathcal{F}(\W^{(k)})\|^2 + 2\alpha\|\boldsymbol{g}_1(\W^{(0)})\|^2 +  2\alpha\|\boldsymbol{g}_2(\W^{(0)})\|^2.
\end{align}
Thus,
\begin{align}
    &\min_k\|\nabla \mathcal{F}(\W^{(k)})\|^2 \nonumber \\
    \le & \frac{1}{K}\sum_{k=0}^{K-1}\|\nabla \mathcal{F}(\W^{(k)})\|^2 \nonumber \\
    \le & \frac{2}{\alpha K}\sum_{k=0}^{K-1}[\mathcal{F}(\W^{(k)}) - \mathcal{F}(\W^{(k+1)})] + \frac{[2+\gamma^2(5-\lambda_1^2)]}{2(K-1)}\sum_{k=1}^{K-1}\|\boldsymbol{\bar{g}}_1(\W^{(0)})\|^2 - 2\|\boldsymbol{\bar{g}}_1(\W^{(0)})\|\|\boldsymbol{\bar{g}}_2(\W^{(0)})\| \nonumber \\
    & -\frac{1-\alpha H}{K}\sum_{k=0}^{K-1}\|\boldsymbol{\bar{g}}_2(\W^{(k)})\|^2 + 4\|\boldsymbol{g}_1(\W^{(0)})\|^2 +  4\|\boldsymbol{g}_2(\W^{(0)})\|^2 \nonumber \\
    \le & \frac{2}{\alpha K}[\mathcal{F}(\W^{(0)}) - \mathcal{F}(\W^\star)] + \frac{[2+\gamma^2(5-\lambda_1^2)]}{2}\|\boldsymbol{\bar{g}}_1(\W^{(0)})\|^2 + 4\|\boldsymbol{g}_1(\W^{(0)})\|^2 +  4\|\boldsymbol{g}_2(\W^{(0)})\|^2,
\end{align}
where the last inequality holds due to $\mathcal{F}(\W^\star) \le \mathcal{F}(\W^{(K)})$.
\end{proof}

\section{Proof of Theorem \ref{theorem_2}}\label{P2}
\begin{proof}
(1) For local low-rank orthogonal-projection-based update, we have
\begin{align}\label{local_update_eq}
    \W^s = \W - \alpha[\boldsymbol{g}_2(\W) - \mathrm{Proj}_{K_h^{(q)}D_1}(\boldsymbol{g}_2(\W))] = \W - \alpha\boldsymbol{\bar{g}}_2(\W).
\end{align}
For full-rank orthogonal-projection-based update, we have
\begin{align}\label{global_update_eq}
    \W^c = \W - \alpha[\boldsymbol{g}_2(\W) - \mathrm{Proj}_{D_1}(\boldsymbol{g}_2(\W))] = \W - \alpha\boldsymbol{\ddot{g}}_2(\W).
\end{align}
Based on Eq.~\eqref{eq:2} and the smoothness of the objective function, we have an upper bound on $\mathcal{F}(\W^s)$:
\begin{align}\label{eq:upper}
    \mathcal{F}(\W^s) \le \mathcal{F}(\W) - [\alpha - \frac{\alpha^2 H}{2}]\|\boldsymbol{\bar{g}}_2(\W)\|^2 - \alpha \langle \boldsymbol{\bar{g}}_1(\W), \boldsymbol{\bar{g}}_2(\W) \rangle,
\end{align}
and a lower bound on $\mathcal{F}(\W^c)$:
\begin{align}\label{eq:lower}
    \mathcal{F}(\W^c) \ge \mathcal{F}(\W) + \nabla \mathcal{F}(\W)^\top(\W^c - \W) - \frac{H}{2}\|\W^c - \W\|^2.
\end{align}
Combining Eq.~\eqref{eq:upper} and Eq.~\eqref{eq:lower}, we have
\begin{align}\label{eq:t2}
    & \mathcal{F}(\W^s) \nonumber \\
    \le & \mathcal{F}(\W^c) - \nabla \mathcal{F}(\W)^\top(\W^c - \W) + \frac{H}{2}\|\W^c - \W\|^2 - [\alpha - \frac{\alpha^2 H}{2}]\|\boldsymbol{\bar{g}}_2(\W)\|^2 - \alpha \langle \boldsymbol{\bar{g}}_1(\W), \boldsymbol{\bar{g}}_2(\W) \rangle \nonumber \\
    = & \mathcal{F}(\W^c) - \langle \boldsymbol{g}_1(\W) + \boldsymbol{g}_2(\W), -\alpha \boldsymbol{\ddot{g}}_2(\W) \rangle + \frac{\alpha^2 H}{2}\|\boldsymbol{\ddot{g}}_2(\W)\|^2 - [\alpha - \frac{\alpha^2 H}{2}]\|\boldsymbol{\bar{g}}_2(\W)\|^2 \nonumber \\
    & - \alpha \langle \boldsymbol{\bar{g}}_1(\W), \boldsymbol{\bar{g}}_2(\W) \rangle \nonumber \\
    = & \mathcal{F}(\W^c) + \alpha \langle \boldsymbol{g}_1(\W), \alpha \boldsymbol{\ddot{g}}_2(\W) \rangle + \alpha \langle \boldsymbol{g}_2(\W), \boldsymbol{\ddot{g}}_2(\W) \rangle +  \frac{\alpha^2 H}{2}\|\boldsymbol{\ddot{g}}_2(\W)\|^2 - [\alpha - \frac{\alpha^2 H}{2}]\|\boldsymbol{\bar{g}}_2(\W)\|^2 \nonumber \\
    & - \alpha \langle \boldsymbol{\bar{g}}_1(\W), \boldsymbol{\bar{g}}_2(\W) \rangle \nonumber \\
    = & \mathcal{F}(\W^c) + [\alpha + \frac{\alpha^2 H}{2}]\|\boldsymbol{\ddot{g}}_2(\W)\|^2 - [\alpha - \frac{\alpha^2 H}{2}]\|\boldsymbol{\bar{g}}_2(\W)\|^2 - \alpha \langle \boldsymbol{\bar{g}}_1(\W), \boldsymbol{\bar{g}}_2(\W) \rangle,
\end{align}
where the last equality is true because
\begin{align}
    \langle \boldsymbol{g}_2(\W), \boldsymbol{\ddot{g}}_2(\W) \rangle = \langle \mathrm{Proj}_{D_1}(\boldsymbol{g}_2(\W)), \boldsymbol{\ddot{g}}_2(\W) \rangle + \langle \boldsymbol{\ddot{g}}_2(\W), \boldsymbol{\ddot{g}}_2(\W) \rangle,
\end{align}
and both $\boldsymbol{g}_1(\W)$ and $\mathrm{Proj}_{D_1}(\boldsymbol{g}_2(\W))$ are orthogonal to $\boldsymbol{\ddot{g}}_2(\W)$.
Based on Eq.~\eqref{eq:inner}, the last term has:
\begin{align}\label{eq:inner2}
    & \langle \boldsymbol{\bar{g}}_1(\W), \boldsymbol{\bar{g}}_2(\W) \rangle \nonumber \\
    \ge & -\frac{H^2(1-\lambda_1^2)}{4}\|\W - \W^{(0)}\|^2 - \frac{1}{2}\|\boldsymbol{\bar{g}}_2(\W)\|^2 - \frac{1}{2}\|\boldsymbol{\bar{g}}_1(\W^{(0)})\|^2 + \langle \boldsymbol{\bar{g}}_1(\W^{(0)}), \boldsymbol{\bar{g}}_2(\W^{(0)}) \rangle.
\end{align}
Suppose that $\W$ is the model update at $n$-th iteration where $n \le K$. For the local low-rank orthogonal-projection-based update,
\begin{align}\label{eq:adiff1}
    \|\W^{(k)} - \W^{(0)}\|^2 & = \alpha^2\|\sum_{i=0}^{n}\boldsymbol{\bar{g}}_2(\W^{(i)})\|^2 \nonumber \\
    & \le \frac{\gamma^2\|\boldsymbol{\bar{g}}_1(\W^{(0)})\|^2}{H^2B^2K^2}n\sum_{i=0}^{n}\|\boldsymbol{\bar{g}}_2(\W^{(i)})\|^2 \nonumber \\
    & \le \frac{\gamma^2 n^2\|\boldsymbol{\bar{g}}_1(\W^{(0)})\|^2}{H^2K^2} \nonumber \\
    & \le \frac{\gamma^2\|\boldsymbol{\bar{g}}_1(\W^{(0)})\|^2}{H^2},
\end{align}
and similarly for full-rank orthogonal-projection-based update, we also have
\begin{align}\label{eq:adiff2}
    \|\W^{(k)} - \W^{(0)}\|^2 \le \frac{\gamma^2\|\boldsymbol{\bar{g}}_1(\W^{(0)})\|^2}{H^2}.
\end{align}
Therefore, continuing with Eq.~\eqref{eq:inner2}, we obtain:
\begin{align}
     & \langle \boldsymbol{\bar{g}}_1(\W^{(k)}), \boldsymbol{\bar{g}}_2(\W^{(k)}) \rangle \nonumber \\
     \ge & -\frac{2+\gamma^2(1-\lambda_1^2)}{4}\|\boldsymbol{\bar{g}}_1(\W^{(0)})\|^2 + \|\boldsymbol{\bar{g}}_1(\W^{(0)})\|\|\boldsymbol{\bar{g}}_2(\W^{(0)})\| - \frac{1}{2}\|\boldsymbol{\bar{g}}_2(\W)\|^2 \nonumber \\
     \ge & - \frac{1}{2}\|\boldsymbol{\bar{g}}_2(\W)\|^2,
\end{align}
where the last inequality holds due to Eq.~\eqref{eq:lambda_1}.
Continuing with Eq.~\eqref{eq:t2}, we get:
\begin{align}\label{eq:before_proj}
    \mathcal{F}(\W^s) \le \mathcal{F}(\W^c) + [\alpha + \frac{\alpha^2 H}{2}]\|\boldsymbol{\ddot{g}}_2(\W)\|^2 - [\frac{\alpha}{2} - \frac{\alpha^2 H}{2}]\|\boldsymbol{\bar{g}}_2(\W)\|^2.
\end{align}
Based on assumption, we have
\begin{align}
    \|\mathrm{Proj}_{K_h^{(q)}D_1}(\boldsymbol{g}_2(\W))\|_2 = \lambda_3 \|\mathrm{Proj}_{D_1}(\boldsymbol{g}_2(\W))\|_2 \le \|\mathrm{Proj}_{D_1}(\boldsymbol{g}_2(\W))\|_2,
\end{align}
thus
\begin{align}\label{eq:proj}
    \|\boldsymbol{\bar{g}}_2(\W)\|^2 & = \|\boldsymbol{\ddot{g}}_2(\W)\|^2 + \|\mathrm{Proj}_{D_1}(\boldsymbol{g}_2(\W))\|^2 - \|\mathrm{Proj}_{K_h^{(q)}D_1}(\boldsymbol{g}_2(\W))\|^2 \nonumber \\
    & = \|\boldsymbol{\ddot{g}}_2(\W)\|^2 + (1-\lambda_3^2)\|\mathrm{Proj}_{D_1}(\boldsymbol{g}_2(\W))\|^2.
\end{align}
Combining Eq.~\eqref{eq:before_proj} and Eq.~\eqref{eq:proj} on $\|\boldsymbol{\ddot{g}}_2(\W)\|^2$, we have
\begin{align}
    \mathcal{F}(\W^s) &\le \mathcal{F}(\W^c) + [(\alpha + \frac{\alpha^2 H}{2}) - (\frac{\alpha}{2} - \frac{\alpha^2 H}{2})]\|\boldsymbol{\bar{g}}_2(\W)\|^2 - (1-\lambda_3^2)[\alpha + \frac{\alpha^2 H}{2}]\|\mathrm{Proj}_{D_1}(\boldsymbol{g}_2(\W))\|^2 \nonumber \\
    & \le \mathcal{F}(\W^c) + [(\frac{\alpha}{2} + \alpha^2 H)(1-\lambda_1^2)]\|\boldsymbol{g}_2(\W)\|^2 - (1-\lambda_3^2)[\alpha + \frac{\alpha^2 H}{2}]{\lambda^\prime_1}^2\|\boldsymbol{g}_2(\W)\|^2,
\end{align}
%where the last inequality holds with definition \ref{defn:SuffProj} that $\|\mathrm{Proj}_{D_1}(\boldsymbol{g}_2(\W))\| \ge \lambda^\prime_1\|\boldsymbol{g}_2(\W)\|$ and
where the last inequality holds with global sufficient project definition (Definition 1 in \citep{lin2022beyond}) that $\|\mathrm{Proj}_{D_1}(\boldsymbol{g}_2(\W))\| \ge \lambda^\prime_1\|\boldsymbol{g}_2(\W)\|$ and
\begin{align}
    \|\boldsymbol{g}_2(\W)\|^2 & = \|\mathrm{Proj}_{K_h^{(q)}D_1}(\boldsymbol{g}_2(\W)) + \boldsymbol{\bar{g}}_2(\W)\|^2 \nonumber \\
    & = \|\mathrm{Proj}_{K_h^{(q)}D_1}(\boldsymbol{g}_2(\W))\|^2 + \|\boldsymbol{\bar{g}}_2(\W)\|^2 \nonumber \\
    & \ge \lambda_1^2\|\boldsymbol{g}_2(\W)\|^2 + \|\boldsymbol{\bar{g}}_2(\W)\|^2.
\end{align}
Considering
\begin{align}
    & \lambda_1 \ge \sqrt{1 - \frac{(1-\lambda_3^2)(2+\alpha H){\lambda^\prime}_1^2}{1+2\alpha H}} \nonumber \\
    \implies & \alpha(1-\lambda_1^2)(1+2\alpha H) \le \alpha(1-\lambda_3^2)(2+\alpha H){\lambda^\prime_1}^2,
\end{align}
we get $\mathcal{F}(\W^s) \le \mathcal{F}(\W^c)$.

(2) Base on the smoothness of the loss function, we have
\begin{align}
    \mathcal{L}_1(\W^{(k)}) & \le \mathcal{L}_1(\W^{(0)}) + \langle \boldsymbol{g}_1(\W^{(0)}), \W^{(k)} - \W^{(0)} \rangle + \frac{H}{4}\|\W^{(k)} - \W^{(0)}\|^2 \nonumber \\
    & = \mathcal{L}_1(\W^{(0)}) + \langle \boldsymbol{g}_1(\W^{(0)}), -\alpha \sum_{i=0}^{k-1}\boldsymbol{\bar{g}}_2(\W^{(i)}) \rangle + \frac{\alpha^2 H}{4}\|\sum_{i=0}^{k-1}\boldsymbol{\bar{g}}_2(\W^{(i)}\|^2 \nonumber \\
    & = \mathcal{L}_1(\W^{(0)}) - \alpha\sum_{i=0}^{k-1}\langle \boldsymbol{\bar{g}}_1(\W^{(0)}), \boldsymbol{\bar{g}}_2(\W^{(i)}) \rangle + \frac{\alpha^2 H}{4}\|\sum_{i=0}^{k-1}\boldsymbol{\bar{g}}_2(\W^{(i)}\|^2 \nonumber \\
    & \le  \mathcal{L}_1(\W^{(0)}) - \alpha\|\boldsymbol{\bar{g}}_1(\W^{(0)})\|[\sum_{i=0}^{k-1}\|\boldsymbol{\bar{g}}_2(\W^{(i)})\|] + \frac{\alpha^2 H k}{4}\sum_{i=0}^{k-1}\|\boldsymbol{\bar{g}}_2(\W^{(i)}\|^2.
\end{align}
Since $\alpha \le \frac{4 \| \boldsymbol{\bar{g}}_1(\W^{(0)})\|}{HBk^{1.5}}$, we have
\begin{align}
    \frac{\alpha H k}{4}\sum_{i=0}^{k-1}\|\boldsymbol{\bar{g}}_2(\W^{(i)}\|^2 & \le \frac{\|\boldsymbol{\bar{g}}_1(\W^{(0)})\|}{B\sqrt{k}}\sum_{i=0}^{k-1}\|\boldsymbol{\bar{g}}_2(\W^{(i)}\|^2 \nonumber \\
    & \le \frac{\|\boldsymbol{\bar{g}}_1(\W^{(0)})\|(\sum_{i=0}^{k-1}\|\boldsymbol{\bar{g}}_2(\W^{(i)}\|^2)}{\sqrt{\sum_{i=0}^{k-1}\|\boldsymbol{\bar{g}}_2(\W^{(i)}\|^2}} \nonumber \\
    & \le \|\boldsymbol{\bar{g}}_1(\W^{(0)})\| \sqrt{\sum_{i=0}^{k-1}\|\boldsymbol{\bar{g}}_2(\W^{(i)}\|^2} \nonumber \\
    & \le \boldsymbol{\bar{g}}_1(\W^{(0)})\| [\sum_{i=0}^{k-1}\|\boldsymbol{\bar{g}}_2(\W^{(i)}\|].
\end{align}
Therefore, $\mathcal{L}_1(\W^{(k)}) \le \mathcal{L}_1(\W^{(0)})$
\end{proof}


\section{Proof of Theorem \ref{theorem_3}}\label{P3}
\begin{proof}
Suppose we define the updates as Eq.~\eqref{local_update_eq} and Eq.~\eqref{global_update_eq} for local low-rank and full-rank updates, for a $\frac{H}{2}$-smooth loss function $\mathcal{L}$, given $\mathcal{F}$ is $H$-smooth, we can have an upper bound on $\mathcal{F}(\W^c)$:
\begin{align}\label{error_main}
    \mathcal{F}(\W^c) \le \mathcal{F}(\W^s) + \nabla \mathcal{F}(\W)^\top(\W^c - \W^s) + \frac{H}{2}\|\W^c - \W^s\|^2.
\end{align}
As the projection consist of the basis and scaling matrices, we have:
\begin{align}\label{transform_local}
\W^c - \W^s = \mathrm{Proj}_{D_1}(\boldsymbol{g}_2(\W)) - \mathrm{Proj}_{K_h^{(q)}D_1}(\boldsymbol{g}_2(\W)) = (\S_{j} \Q_{j} {\S_{j}}^\top - \S^{(q)}_{j} \Q^{(q)}_{j} {\S^{(q)}_{j}}^\top) \boldsymbol{g}_2(\W),
\end{align}
where $\mathrm{Proj}_{D_1}(\boldsymbol{g}_2(\W))$ defines the projection on the input local model space for anchor point $q$ of old task $j$, and $\S_{j}$ is the basis for global model space and $\mathrm{Proj}_{K_h^{(q)}D_1}(\boldsymbol{g}_2(\W))$ defines the projection on the input local model space for anchor point $q$ of old task $j$, and $\S^{(q)}_{j}$ is the basis for this local model space. $\Q_{j}$ and $\Q^{(q)}_{j}$ are the squared scaling matrices corresponding to the basis $\S_{j}$ and $\S^{(q)}_{j}$.

Without loss of generality, for any anchor point $s_q$, we denote by $B_h(s_q)$ the neighborhood of indices near that anchor point, $B_h(s_q) \defeq \{ \forall s^\prime \in [M] \times [N]: d(s_q, s^\prime) < h \}$ and we use $M(h, s_q)$ and $N(h, s_q)$ to denote the number of unique row and column indices in $\mathcal{B}_h(s_q)$. Also, we denote $n_q = \min(M(h, s_q), N(h, s_q))$.

Based on the loss function Eq.~\eqref{eq:obj}, we denote the mapping function $\mathcal{T}(s) = \R_j^l$ as the original matrix where $s = (a, b) \in [M] \times [N_j]$. Then we can describe it locally with $\mathcal{\hat{T}}(s) = \mathcal{\hat{T}}(s_q) = \A^{(q)}{\B^{(q)}}^{\top}$ as the estimate for each anchor point $s_q$. Then following Proposition 1 of \citep{lee2013local}, given that if $|\Omega \cap \mathcal{B}_h(s_q)| \le C\mu^2r^\prime n_q\log^6n_q$, with probability greater than $1-n_q^{-3}$, we have the total squared-error within a neighborhood of $s_q$ bounded by the following:
\begin{align}
\mathcal{E}(\mathcal{T})(q, h) = \|K_h^{(q)} \odot(\mathcal{T}(s) - \mathcal{\hat{T}}(s))\|_F \le Z^\prime h^\beta(4\sqrt{\frac{n_q(2+p)}{p}+2}) = Z^\prime h^\beta(4\sqrt{3n_q+2}),
\end{align}
here, $\mathcal{T}$ is Hölder continuous $\|\mathcal{T}(x) - \mathcal{T}(x^\prime)\|_F \le Z^\prime d^\beta(x, x^\prime)$ with parameter $Z^\prime, \beta > 0$. $\mathcal{T}(s)$ is a rank $r^\prime$ matrix satisfies the strong incoherence property with parameter $\mu$ described by \citep{candes2010matrix}. $C$ is a constant. The kernel function $K_h$ is a uniform kernel based on a product distance function. $p = \frac{|\Omega \cap \mathcal{B}_h(s_q)|}{|\mathcal{B}_h(s_q)|}$ is the density of observed samples. Given that the observed set of indices of the matrix is full in our case, thus $|\Omega \cap \mathcal{B}_h(s_q)| = |\mathcal{B}_h(s_q)|$ and $p = 1$.

Let $T = \mathcal{T}^2 = \R_j^l{\R_j^l}^\top$, which is the gram matrix of the original matrix, it is easy to prove that the functions is still Hölder continuous $\|T(x) - T(x^\prime)\|_F \le Z d^\beta(x, x^\prime)$ with new parameter $Z > 0$ and the same parameter $\beta > 0$. Thus, the above inequality still hold given $|\Omega \cap \mathcal{B}_h(s_q)| \le C\mu^2r n_q\log^6n_q$ with probability greater than $1-n_q^{-3}$
\begin{align}
\mathcal{E}(T)(q, h) = \|K_h^{(q)} \odot(T(s) - \hat{T}(s))\|_F \le Z h^\beta(4\sqrt{3n_q+2}),
\end{align}
since the number of indices $n_q$ remains the same under squared matrix and the rank of the $r = rank(T(s)) = rank(\mathcal{T}(s)\mathcal{T}(s)^\top) \le min(rank(\mathcal{T}(s)), rank(\mathcal{T}(s)^\top) = r^\prime$.

Note that for SVD, the left and right singular matrices are unitary matrices, i.e., $UU^\top = VV^\top = I$, and are actually the eigenvector of matrices $RR^\top$. $\Q_{j}$ and $\Q^{(q)}_{j}$ are diagonal matrices. Hence, the local space projection corresponding with anchor point $s_q$ for $K_h^{(q)}T(s)$ and $K_h^{(q)}\hat{T}(s)$, we have:
\begin{align}
\|\S_{j} \Q_{j} {\S_{j}}^\top - \S^{(q)}_{j} \Q^{(q)}_{j}{\S^{(q)}_{j}}^\top\|_F & \le \|\S_{j} \sqrt{\Q_j} \V_j\V_j^\top \sqrt{\Q_j} ^\top {\S_{j}}^\top - \S^{(q)}_{j} \sqrt{\Q^{(q)}_{j}} \V^{(q)}_{j}{\V^{(q)}_{j}}^\top \sqrt{\Q^{(q)}_{j}} ^\top {\S^{(q)}_{j}}^\top\|_F \nonumber \\ & = \|K_h^{(q)} \odot(T(s) - \hat{T}(s))\|_F.
\end{align}
continuing with Eq.~\eqref{error_main} and Eq.~\eqref{transform_local}, we obtain:
\begin{align}\label{eq:loss_error}
\mathcal{F}(\W^c) - \mathcal{F}(\W^s) & \le \langle \boldsymbol{g}_1(\W) + \boldsymbol{g}_2(\W) , \boldsymbol{g}_2(\W) \rangle Z h^\beta(4\sqrt{3n_q+2}) + H\|\boldsymbol{g}_2(\W)\|^2 Z^2 h^{2\beta}(24n_q+9) \nonumber \\
&= \langle \boldsymbol{g}_1(\W), \boldsymbol{g}_2(\W) \rangle Z h^\beta(4\sqrt{3n_q+2}) + \|\boldsymbol{g}_2(\W)\|^2 [Z h^\beta(4\sqrt{3n_q+2}) + HZ^2 h^{2\beta}(24n_q+9)].
\end{align}
We follow the similar derivation of Eq.~\eqref{eq:1} for the term $\langle \boldsymbol{g}_1(\W), \boldsymbol{g}_2(\W) \rangle$ and have:
\begin{align}
    &\langle \boldsymbol{g}_1(\W), \boldsymbol{g}_2(\W) \rangle \nonumber \\
    = & \langle \boldsymbol{g}_1(\W) - \boldsymbol{g}_1(\W^{(0)}) + \boldsymbol{g}_1(\W^{(0)}), \boldsymbol{g}_2(\W) \rangle \nonumber \\
    = & \langle \boldsymbol{g}_1(\W) - \boldsymbol{g}_1(\W^{(0)}), \boldsymbol{g}_2(\W) \rangle + \langle \boldsymbol{g}_1(\W^{(0)}), \boldsymbol{g}_2(\W) \rangle \nonumber \\
    = & \langle \boldsymbol{g}_1(\W) - \boldsymbol{g}_1(\W^{(0)}), \boldsymbol{g}_2(\W) \rangle + \langle \boldsymbol{g}_1(\W^{(0)}), \boldsymbol{g}_2(\W) - \boldsymbol{g}_2(\W^{(0)}) \rangle + \langle \boldsymbol{g}_1(\W^{(0)}), \boldsymbol{g}_2(\W^{(0)}) \rangle.
\end{align}
Considering:
\begin{align}\label{aeq:1}
    & -2\langle \boldsymbol{g}_1(\W) - \boldsymbol{g}_1(\W^{(0)}), \boldsymbol{g}_2(\W) \rangle + \|\boldsymbol{g}_1(\W) - \boldsymbol{g}_1(\W^{(0)})\|^2 + \|\boldsymbol{g}_2(\W)\|^2 \nonumber \\
    & = \|\boldsymbol{g}_1(\W) - \boldsymbol{g}_1(\W^{(0)}) +  \boldsymbol{g}_2(\W)\|^2 \ge 0,
\end{align}
we have:
\begin{align}\label{aineq:1}
    \langle \boldsymbol{g}_1(\W) - \boldsymbol{g}_1(\W^{(0)}), \boldsymbol{g}_2(\W) \rangle \le \frac{1}{2}\|\boldsymbol{g}_1(\W) - \boldsymbol{g}_1(\W^{(0)})\|^2 +\frac{1}{2}\|\boldsymbol{g}_2(\W)\|^2,
\end{align}
and similarly:
\begin{align}\label{aineq:2}
    \langle \boldsymbol{g}_1(\W^{(0)}), \boldsymbol{g}_2(\W) - \boldsymbol{g}_2(\W^{(0)}) \rangle \le \frac{1}{2}\|\boldsymbol{g}_2(\W) - \boldsymbol{g}_2(\W^{(0)})\|^2 +\frac{1}{2}\|\boldsymbol{g}_1(\W^{(0)})\|^2.
\end{align}
Combining Eq.~\eqref{aeq:1}, Eq.~\eqref{aineq:1} and Eq.~\eqref{aineq:2} gives an upper bound on $\langle \boldsymbol{g}_1(\W), \boldsymbol{g}_2(\W) \rangle$, i.e.,
\begin{align}\label{aeq:inner}
    & \langle \boldsymbol{g}_1(\W), \boldsymbol{g}_2(\W) \rangle \nonumber \\
    \le & \frac{1}{2}\|\boldsymbol{g}_1(\W) - \boldsymbol{g}_1(\W^{(0)})\|^2 +\frac{1}{2}\|\boldsymbol{g}_2(\W)\|^2 \nonumber \\
    & +\frac{1}{2}\|\boldsymbol{g}_2(\W) - \boldsymbol{g}_2(\W^{(0)})\|^2 +\frac{1}{2}\|\boldsymbol{g}_1(\W^{(0)})\|^2 + \langle \boldsymbol{g}_1(\W^{(0)}), \boldsymbol{g}_2(\W^{(0)}) \rangle \nonumber \\
    \le & \frac{H^2}{8}\|\W - \W^{(0)}\|^2 + \frac{1}{2}\|\boldsymbol{g}_2(\W)\|^2 \nonumber \\
    & +\frac{H^2}{8}\|\W - \W^{(0)}\|^2 + \frac{1}{2}\|\boldsymbol{g}_1(\W^{(0)})\|^2 + \langle \boldsymbol{g}_1(\W^{(0)}), \boldsymbol{g}_2(\W^{(0)}) \rangle \nonumber \\
    \le & \frac{H^2}{4}\|\W - \W^{(0)}\|^2 + \frac{1}{2}\|\boldsymbol{g}_2(\W)\|^2 + \frac{1}{2}\|\boldsymbol{g}_1(\W^{(0)})\|^2 + \langle \boldsymbol{g}_1(\W^{(0)}), \boldsymbol{g}_2(\W^{(0)}) \rangle,
\end{align}
Suppose that $\W$ is the model direct update without any projection at $n$-th iteration where $n \le K$. Similar to Eq.~\eqref{eq:adiff1} and Eq.~\eqref{eq:adiff2}, we always have:
\begin{align}\label{aeq:adiff1}
    \|\W^{(k)} - \W^{(0)}\|^2 \le \frac{\gamma^2\|\boldsymbol{g}_1(\W^{(0)})\|^2}{H^2}.
\end{align}
Therefore, continuing with Eq.~\eqref{aeq:inner}, we obtain:
\begin{align}\label{eq:inner_error}
     \langle \boldsymbol{g}_1(\W), \boldsymbol{g}_2(\W) \rangle \nonumber \le \frac{2+\gamma^2}{4}\|\boldsymbol{g}_1(\W^{(0)})\|^2 + \|\boldsymbol{g}_1(\W^{(0)})\|\|\boldsymbol{g}_2(\W^{(0)})\| + \frac{1}{2}\|\boldsymbol{g}_2(\W)\|^2.
\end{align}
Combine Eq.~\eqref{eq:inner_error} and Eq.~\eqref{eq:loss_error}, noted that the function is $B$-Lipschitz thus $\|\boldsymbol{g}_2(\W)\| \le B$:
\begin{align}
\mathcal{F}(\W^c) - \mathcal{F}(\W^s) &= \langle \boldsymbol{g}_1(\W), \boldsymbol{g}_2(\W) \rangle Z h^\beta(4\sqrt{3n_q+2}) + \|\boldsymbol{g}_2(\W)\|^2 [Z h^\beta(4\sqrt{3n_q+2}) + HZ^2 h^{2\beta}(24n_q+9)] \nonumber \\
&\le Z h^\beta(4\sqrt{3n_q+2})[\frac{2+\gamma^2}{4}\|\boldsymbol{g}_1(\W^{(0)})\|^2 + \|\boldsymbol{g}_1(\W^{(0)})\|\|\boldsymbol{g}_2(\W^{(0)})\| + \frac{3}{2}B^2] + HZ^2 h^{2\beta}(24n_q+9)B^2.
\end{align}
The error of $\mathcal{F}(\W^c) - \mathcal{F}(\W^s)$ has the above upper bound.

For the lower bound of $\mathcal{F}(\W^c) - \mathcal{F}(\W^s)$, similar to Eq.~\eqref{error_main}, we first have the lower bound for error $\mathcal{F}(\W^c)$ as:
\begin{align}\label{error_main2}
    \mathcal{F}(\W^c) \ge \mathcal{F}(\W^s) + \nabla \mathcal{F}(\W)^\top(\W^c - \W^s) - \frac{H}{2}\|\W^c - \W^s\|^2,
\end{align}
then following the similar proof above, we have:
\begin{align}
\mathcal{F}(\W^c) - \mathcal{F}(\W^s) \ge Z h^\beta(4\sqrt{3n_q+2})[-\frac{2+\gamma^2}{4}\|\boldsymbol{g}_1(\W^{(0)})\|^2 + \|\boldsymbol{g}_1(\W^{(0)})\|\|\boldsymbol{g}_2(\W^{(0)})\| + \frac{1}{2}B^2] - HZ^2 h^{2\beta}(24n_q+9)B^2.
\end{align}
which is the lower bound of error $\mathcal{F}(\W^c) - \mathcal{F}(\W^s)$. Since the proof is similar, we omit this in the paper.
\end{proof}



\section{Popular kernel functions}\label{T3}
We list the popular kernel functions in Table \ref{table:kernel}. The distance $d$ can be computed by some standard distance measures such as $\ell_2$ or cosine similarity. For example, for a global representation matrix $\R_{j}^{l}= [\r_{j, 1}^{l},...\r_{j, N^j}^{l}] \in \mathbb{R}^{M \times N^j}$ for layer $l$ task $j$, the distance between $a$ and $b$ on space $[N^j]$ is $d(a, b) = \arccos(\frac{\langle \r_{j, a}^{l}, \r_{j, b}^{l} \rangle}{\|\r_{j, a}^{l}\|\cdot \|\r_{j, b}^{l}\|})$, where $\r_{j, a}^{l}, \r_{j, b}^{l}$ are the $a$-th and $b$-th rows of the matrix $\R_{j}^{l}$.
\begin{table}[ht]
  \caption{Popular kernel functions and their efficiencies relative to Epanechnikov kernel.}
  \label{table:kernel}
  \centering
  \begin{tabular}{lll}
    \toprule
    Kernel Type & Kernel Function & Efficiency(\%) \\
    \midrule
    Uniform & $K_h(s_1, s_2) \propto \boldsymbol{1}[d(s_1,s_2)<h]$ & 92.9 \\
    Logistic & $K_h(s_1, s_2) \propto \frac{1}{\exp(\nicefrac{d(s_1, s_2)}{h}) + 2 +  \exp(\nicefrac{-d(s_1, s_2)}{h})}$ & 88.7 \\
    Gaussian & $K_h(s_1, s_2) \propto \frac{1}{\sqrt{2\pi}}\exp(-\frac{1}{2}h^{-2}d(s_1, s_2)^2)$ & 95.1 \\
    Triangular & $K_h(s_1, s_2) \propto (1 - \nicefrac{d(s_1,s_2)}{h})\boldsymbol{1}[d(s_1,s_2)<h]$ & 98.6 \\
    Cosine & $K_h(s_1, s_2) \propto \frac{\pi}{4}\cos{\frac{\pi d(s_1, s_2)}{2h}}\boldsymbol{1}[d(s_1,s_2)<h]$ & 99.9 \\
    Epanechnikov & $K_h(s_1, s_2) \propto \frac{3}{4}[1-(\nicefrac{d(s_1, s_2)}{h})^2]\boldsymbol{1}[d(s_1,s_2)<h]$ & 100 \\
    Silverman & $K_h(s_1, s_2) \propto \frac{1}{2}\exp(-\frac{|\nicefrac{d(s_1, s_2)}{h}|}{\sqrt{2}})\cdot\sin(\frac{|\nicefrac{d(s_1, s_2)}{h}|}{\sqrt{2}} + \frac{\pi}{4})$ & N/A \\
    \bottomrule
  \end{tabular}
\end{table}

\section{Datasets information}\label{T5}
We evaluate the performance of our LMSP on four public
datasets for CL: (1) Permuted MNIST~\citep{lecun2010mnist}: (PMNIST) is a variant of the MNIST dataset \citep{lecun2010mnist}, where the input pixels are randomly permuted. Following \citep{lopez2017gradient,saha2021gradient}, the dataset is divided into 10 tasks by different permutations and each task contains 10 classes; 
(2) CIFAR-100 Split~\citep{krizhevsky2009learning}: the CIFAR-100 dataset \citep{krizhevsky2009learning} is divided into 10 different tasks, and each task is a 10-way multi-class classification problem;
(3) 5-Datasets~\citep{lin2022beyond,lin2022trgp}: we follow the setting of \citep{lin2022beyond,lin2022trgp} to use a sequence of 5 datasets, which are CIFAR-10, MNIST, SVHN~\citep{netzer2011reading}, not-MNIST~\citep{bulatov2011notmnist}, Fashion MNIST~\citep{xiao2017fashion}, and the classification problem on each dataset is an individual task;
and (4) MiniImageNet~\citep{vinyals2016matching}: the MiniImageNet dataset \citep{vinyals2016matching} is divided into 20 tasks, and each task includes 5 classes.

\section{Ablation studies on kernel type}\label{T4}
Figure \ref{ablation_study:kernel_type} shows the influence of different kernels. We adopted five different kernels in our model and the result shows that the Gaussian kernel reach the best performance. Beside, the kernel effect is not that obvious and the overall performance are similar thus we could choose the simplest one in practise to reduce the computation.

\begin{figure}[ht]
\centering
\includegraphics[width=0.72\linewidth]{Fig/new_appendix.pdf}
\caption{Ablation studies on kernel type.
\label{ablation_study:kernel_type}}
\end{figure}

\section{Results of forward knowledge transfer.}\label{T6}
We show the results of forward knowledge transfer(FWT) in the Table \ref{table:fwt}. We compared the FWT performance of our LMSP approach to those of the GPM, TRGP, and CUBER methods, which are the most related work to our paper. The value for GPM is zero because we treat GPM as the baseline and consider the relative FWT improvement over GPM. We compare them using four public datasets. We can see from the table that the FWT performance of our LMSP approach beats those of the TRGP and CUBER (two most related and state-of-the-art methods) on the PMNIST, Cifar-100 Split, and 5-Dataset datasets, and is comparable to those of the TRGP and CUBER on the MiniImageNet dataset. Clearly, this shows that the good BWT performance of our LMSP method is not achieved at the cost of sacrificing the FWT performance.

\begin{table}[ht]
  %\captionsetup{labelfont={color=blue},font={color=blue}}
  \caption{Comparison of FWT among GPM, TRGP, CUBER and LMSP. The value for GPM is zero because we treat GPM as the baseline and consider the relative FWT improvement over GPM.}
  \label{table:fwt}
  \centering
  %\color{blue}
  {\begin{tabular}{lllll}
    \toprule
    FWT (\%) & PMNIST & Cifar-100 Split & 5-Dataset & MiniImageNet \\
    \midrule
    GPM & 0 & 0 & 0 & 0 \\
    TRPG & 0.18 & 2.01 & 1.98 & 2.36 \\
    CUBER & 0.80  & 2.79 & 1.96 & \textbf{3.13} \\
    \textbf{LMSP($r=25$)} & \textbf{0.92} & \textbf{2.89} & \textbf{2.43} & 2.79 \\
    \bottomrule
  \end{tabular}}
\end{table}

\end{document}
