

\subsection{Experimental Evaluations of Channel Pruning and Weight Allocation Techniques}\label{pruning sec}



\noindent\textbf{Empirical benefits of our proposed FLOP-aware pruning.} }%\textcolor{blue}{Both SplitCIFAR100 (Table~\ref{CIFARtable}) and MNIST (Figure~\ref{fig:mnist}) experiments show that our pruning algorithm performs well (by comparison between Individual-0.2/-1)}\ylm{Table~\ref{CIFARtable} shows that our algorithm performs well in our CL manner}, but missing additional experimental comparisons to alternative channel-pruning techniques. In this section, we evaluate our FLOP-aware pruning algorithm separately without continual learning and compare our results to $\ell_1$-based~\cite{liu2017learning} and Polarization-based~\cite{zhuang2020neuron} channel pruning methods. For all the experiments, we use the same ResNet18 architecture and hyperparameters as Table~\ref{CIFARtable} and train over CIFAR10 dataset. Results are presented in Figure~\ref{fig:CIFAR10_acc} \& \ref{fig:CIFAR10_ratio} and they are all averaged from 5 trials. We implement these three methods over different FLOPs constraints and also track the fraction of nonzero channels after pruning. Figure~\ref{fig:CIFAR10_acc} shows test accuracy after pruning. We observe our FLOP-aware pruning method (shown in the Blue curve) does well in maintaining high accuracy compared to the other methods. The benefit of our approach is most visible when the FLOPs constraint is more aggressive (e.g., around 1\% FLOPs). The green curve shows the Polarization-based method and demonstrate that it does not execute the pruning task successfully with less than 50\% FLOPs constraint. We found that this is because below a certain threshold, Polarization tends to prune a whole layer resulting in disconnectivity within the network. Figure~\ref{fig:CIFAR10_ratio} shows the fraction of nonzero channels that remained after pruning, and our method succeeds in keeping more channels while pruning the same number of FLOPs. Here, we emphasize keeping more channels is a measure of connectivity within the network (as we wish to avoid very sparse layers). We also observe that $\ell_1$ works well even for small FLOPs; however, it uses the same $\lambda_l$ penalization for all layers thus it is not FLOP-aware and performs uniformly worse than our FLOP-aware algorithm.

    
    


\begin{figure*}[t]
\vspace{-7pt}
\centering
\begin{subfigure}[t]{.33\textwidth}
  \centering
  \begin{tikzpicture}\hspace{-0pt}
        \node at (0,0) [scale=1.] {\includegraphics[width=\linewidth]{figs/CIFAR10.pdf}};
        \node at (0,-1.9) [scale=0.8] {{FLOPs constraint $\gamma$}};
        \node at (-2.5,0) [scale=0.8,rotate=90] {Accuracy};
    \end{tikzpicture}
        \centering
  \caption{Test accuracy}
	\hspace{-0pt}\label{fig:CIFAR10_acc}
\end{subfigure}\hspace{0pt}\begin{subfigure}[t]{.33\textwidth}
  \centering
  \begin{tikzpicture}\hspace{-0pt}
        \node at (-0,0) [scale=1.] {\includegraphics[width=\linewidth]{figs/CIFAR10_channel_num.pdf}};
        \node at (0,-1.9) [scale=0.8] {FLOPs constraint $\gamma$};
        \node at (-2.75,0) [scale=0.9,rotate=90] {Fraction of Nonzero Channels};
    \end{tikzpicture}
    \caption{{Fraction of nonzero channels
    }}\label{fig:CIFAR10_ratio}
	
\end{subfigure}\hspace{0pt}\begin{subfigure}[t]{.33\textwidth}
  \centering
  \begin{tikzpicture}\hspace{-0pt}
        \node at (0,0) [scale=1.] {\includegraphics[width=\linewidth]{figs/FLOP.pdf}};
        \node at (0,-1.9) [scale=0.8] {FLOPs constraint $\gamma$};
        \node at (-2.5,0) [scale=0.9,rotate=90] {Accuracy};
    \end{tikzpicture}
    \centering
    \caption{{Impact of weight allocation $\alpha$
    }}\label{fig:FLOP}
	
\end{subfigure}\vspace{-10pt}
\caption{{
\red{Experimental evaluations of our FLOP-aware channel pruning and weight allocation techniques.} Fig.~\ref{fig:CIFAR10_acc} \& \ref{fig:CIFAR10_ratio} show channel pruning results for different methods. We compare our FLOP-aware pruning algorithm to the $\ell_1$-based~\cite{liu2017learning} and Polarization-based~\cite{zhuang2020neuron} methods over ResNet18 model used in our CL experiments. These experiments are conducted for standard classification tasks on CIFAR10 dataset and only focus on pruning performance (rather than CL setting). Our FLOP-aware pruning is displayed as the Blue curve, and it broadly outperforms the alternative approaches ($\ell_1$-based and Polarization-based). The performance gap is most notable in the very-few FLOPs regime (which is of more interest for inference-time efficiency ).
In Fig.~\ref{fig:FLOP}, we display the impact of weight allocation $\alpha$ (Eq.~\eqref{wa-eq}) and FLOPs constraint $\gamma$ on \emph{task-averaged ESPN accuracy} for SplitCIFAR100. Setting $\alpha$ too small (Blue curve) degrades performance as tasks do not get enough nonzeros to train on. Setting both $\alpha$ and $\gamma$ to be large also degrades performance (Red \& Green curves), because the first few tasks get to occupy the whole supernetwork at the expense of future tasks. Finally $\alpha=0.1$ (Orange) achieves competitive performance for all $\gamma$ choices.
}}\vspace{-7pt}
\end{figure*}
\section{Application to Shallow Networks}\label{app:application}
As a concrete instantiation of Theorem \ref{cl thm} let us consider a realizable regression setting with a shallow network. More sophisticated examples are deferred to future work. Fix positive integers $d$ and $r_\text{frz}\leq r$. Here, $d$ is the raw feature dimension and $r$ is the representation dimension which is often much smaller than $d$. The ingredients of our neural net example are as follows.
\begin{myenumerate}
\item Let $\psi:\mathbb{R}\rightarrow\mathbb{R}$ be a Lipschitz activation function with $\psi(0)=0$ such as Identity or (parametric) ReLU.
}%\textcolor{blue}{\item Let $\sigma:\mathbb{R}\rightarrow[-1,1]$ be a Lipschitz link function such as logistic function $1/(1+e^{-x})$.}
\item Let $Z$ be a zero-mean noise variable taking values on $[-1,1]$.
\item Fix vectors $(\vct{v}^\star_t)_{t=1}^T\in\mathbb{R}^r$ with $\ell_2$ norms bounded by some $B>0$.
\item Fix matrix $\mtx{W}^\star\in\mathbb{R}^{r\times d}$ with spectral norm (maximum singular value) upper bounded by some $\bar{B}>0$.
\item Given input $\vct{x}\in\mathbb{R}^d$, task $t$ samples an independent $Z$ and assigns the label
\begin{align}
y=\sigma({\vct{v}^\star_t}^\top \psi(\mtx{W}^\star\vct{x}))+Z.\label{planted}
\end{align}
\item Fix $r_\text{frz}\leq r$. Let $\mtx{W}_\text{frz}\in\mathbb{R}^{r\times d}$ be the matrix where the first $r_\text{frz}$ rows }%\textcolor{blue}{are same as $\mtx{W}^\star$}\ylm{This assumes no mismatch} whereas the last $r_\text{new}:=r-r_\text{frz}$ rows are equal to zero.
\end{myenumerate}
This setting assumes that first $r_\text{frz}$ features are generated by $\mtx{W}_\text{frz}$ and \eqref{crl} should learn remaining features $\mtx{W}^\star_\text{new}:=\mtx{W}^\star-\mtx{W}_\text{frz}$ and the classifier heads $(\vct{v}^\star_t)_{t=1}^T$. We remark that above one can use arbitrary $[-C,C]$ limits rather than $[-1,1]$ or one can replace $[-1,1]$ limits with subgaussian tail conditions.

}%\textcolor{blue}{For some $\bar{B}\geq \|\mtx{W}^\star\|_F$}, we choose search space ${\cal{W}}$ to be the set of all matrices $\mtx{W}_\text{new}$ such that spectral norm obeys $\|\mtx{W}_\text{new}\|\leq \bar{B}$ and the first $r_\text{frz}$ rows of $\mtx{W}_\text{new}$ are zero. This way, we focus on learning the missing part of the representation $\mtx{W}^\star$. Let us fix the loss function $\ell$ to be quadratic and denote ${\mtx{V}}=(\vct{v}_t)_{t=1}^T$. Then, \eqref{crl} takes the following parametric form
\begin{align}
&\hat{{\mtx{V}}},\hat{\mtx{W}}_\text{new}=\underset{\mtx{W}=\mtx{W}_\text{frz}+\mtx{W}_\text{new}}{\underset{\tn{\vct{v}_t}\leq B,\mtx{W}_\text{new}\in {\cal{W}}}{\arg\min}}{\widehat{\cal{L}}}({\mtx{V}},\mtx{W}):=\frac 1 T\sum_{t=1}^T}%_{\text{new}} {\widehat{\cal{L}}}_{\mathcal{S}_t}(\vct{v}_t,\mtx{W}) \nonumber\\
&\text{WHERE}\quad {\widehat{\cal{L}}}_{\mathcal{S}_t}(\vct{v}_t,\mtx{W})=\frac{1}{N}\sum_{i=1}^N (y_{ti}-\sigma(\vct{v}_t^\top \psi(\mtx{W}\vct{x}_{ti})))^2. \n
\end{align}
We have the following result regarding this optimization. It is essentially a corollary of Theorem \ref{cl thm}. The proof is deferred to the Appendix \ref{app B}.
\begin{theorem}\label{cl thm3} Consider the problem above with $T$ tasks containing $N$ samples each with datasets $(\mathcal{S}_t)_{t=1}^T$ generated according to \eqref{planted}. Suppose input domain $\mathcal{X}\in\mathbb{R}^d$ has bounded $\ell_2$ norm. With probability at least $1-2e^{-\tau}$, the task-averaged population risk of the solution $(\hat{{\mtx{V}}},\hat{\mtx{W}}_\text{new})$ obeys 
\begin{align}
{\cal{L}}(\hat{{\mtx{V}}},\hat{\mtx{W}}_\text{new})\leq \operatorname{\mathbb{E}}[Z^2]+ \sqrt{\frac{\ordet{T}%_{\text{new}} r+r_\text{new} d+\tau}}{T}%_{\text{new}} N}}.\nonumber
\end{align}
\end{theorem}
\textbf{Interpretation:} Observe that the minimal risk is ${\cal{L}}^\star=\operatorname{\mathbb{E}}[Z^2]$ which is the noise independent of features. The additional components are the excess risk due to finite samples. In light of Theorem \ref{cl thm}, we simply plug in $\cc{{\mtx{H}}}=r$, $\cc{{\boldsymbol{\Phi}}_{\text{new}}}=r_\text{new} d$ and ${\cal{L}}^\star_{\phi_{\text{frz}}}={\cal{L}}^\star$. The first two arise from counting number of trainable parameters: each classifier has $r$ parameters and representation ${\boldsymbol{\Phi}}_{\text{new}}$ has $r_\text{new} d$ parameters. ${\cal{L}}^\star_{\phi_{\text{frz}}}={\cal{L}}^\star$ arises from the fact that we chose $\mtx{W}_\text{frz}$ to be subset of $\mtx{W}^\star$ thus there is no mismatch. }%\textcolor{blue}{When $\mtx{W}_\text{frz}=0$} (i.e.~learning representation from scratch), this bound is comparable to prior works \cite{tripuraneni2020provable,du2020few}, and in fact, it leads to (slightly) improved sample-complexity bounds. }%\textcolor{blue}{For instance, when $\psi$ is identity activation (i.e.~linear setting) \cite{tripuraneni2020provable} requires $TN\gtrsim r^2d$ samples to learn the whereas our sample size grows only linear in $r$ and requires $TN\gtrsim rd$. Additionally, \cite{du2020few} requires per-task sample sizes to obey $N\gtrsim d$ samples whereas we only require $N\gtrsim r$.}




\section{Appendix B: Proof of Theorem \ref{cl thm}}


In this section, we prove our main theoretical result Theorem~\ref{cl thm}.

\begin{proof} To be start, let ${\mtx{H}}_\varepsilon$ and ${\boldsymbol{\Phi}}_{\text{NEW},\varepsilon}$ be $\varepsilon$-covers of ${\mtx{H}}$ and ${\boldsymbol{\Phi}}_{\text{new}}$ and define $\vct\mathcal{F}={\mtx{H}}_\varepsilon^T\times{\boldsymbol{\Phi}}_{\text{NEW},\varepsilon}$. Following the Def.~\ref{def:cov} we have 
\begin{align}
    \log|\vct\mathcal{F}|\leq D ~~~\text{where}~~~\red{D:=(T\mathcal{C}({\mtx{H}})+\mathcal{C}({\boldsymbol{\Phi}}_{\text{new}}))\log((\bar{C}/\varepsilon))}. \nonumber
\end{align}
Set $t=\sqrt{\frac{D+\tau}{cNT}}$ where $c>$ is an absolute constant. Since loss function $\ell(\cdot)$ is bounded in $[0,1]$, then following Hoeffding inequality we have
\begin{align}
    \P(|{\cal{L}}({\vct{\hat{h}}},\hat{\phi})- {\cal{L}}^{\st}_{{\phi}_{\text{frz}}}|\geq t)&\leq2|\vct\mathcal{F}|e^{-cNTt^2}, \nonumber\\
    &\leq2|\vct\mathcal{F}|e^{-D-\tau}, \nonumber\\
    &\leq2e^{-\tau}. \label{prob}
\end{align}
Now consider about the task-averaged risk perturbation introduce by covered set. Let $\phi_{\text{new}},\phi_{\text{new}}'\in{\boldsymbol{\Phi}}_{\text{new}},{\boldsymbol{\Phi}}_{\text{NEW},\varepsilon}$ and $h,h'\in{\mtx{H}},{\mtx{H}}_{\varepsilon}$ be $L$-Lipschitz functions (in Euclidean distance). Then $\phi:=\phi_{\text{new}}+\phi_{\text{frz}}$ and $\phi':=\phi_{\text{new}}'+\phi_{\text{frz}}$ also satisfy $L$-Lipschitz constraint.
\begin{align}
    {\cal{L}}({\vct{h}},\phi)- {\cal{L}}({\vct{h}}',\phi')&\leq|{\cal{L}}({\vct{h}},\phi)- {\cal{L}}({\vct{h}}',\phi')|, \nonumber\\
    &\leq \sup_{t\in[T]}|{\cal{L}}_{\mathcal{S}_t}(h_t,\phi)-{\cal{L}}(h_t',\phi')|, \nonumber\\
    &\leq \sup_{t\in[T], i\in[N]}|\ell(y_{ti},h_t\circ\phi(\vct{x}_{ti}))-\ell(y_{ti},h_t'\circ\phi'(\vct{x}_{ti}))|, \nonumber \\
    &\leq \Gamma\sup_{t\in[T], i\in[N]}|h_t\circ\phi(\vct{x}_{ti})-h_t'\circ\phi'(\vct{x}_{ti})|, \nonumber\\
    &\leq\Gamma\sup_{t\in[T], i\in[N]}|h_t\circ\phi(\vct{x}_{ti})-h_t\circ\phi'(\vct{x}_{ti})|+|h_t\circ\phi'(\vct{x}_{ti})-h_t'\circ\phi'(\vct{x}_{ti})|, \nonumber\\
    &\leq \Gamma (L+1)\varepsilon. \label{pert}
\end{align}
Combine results from \ref{prob} and \ref{pert} by setting $\varepsilon=$, then we find that with probability at least $1-2e^{-\tau}$
\begin{align}
    {\cal{L}}({\vct{\hat{h}}},\hat{\phi})- {\cal{L}}^{\st}_{{\phi}_{\text{frz}}}\leq\Gamma(L+1)\varepsilon+\sqrt{\frac{D+\tau}{cNT}}\nonumber.
\end{align}
\end{proof}
\section{Experimental Setting of Figure~\ref{fig:diversity}}\label{app:imagenet}
Following \cite{mallya2018packnet,hung2019compacting}, we conduct experiments in Sec.~\ref{sec:crl_exp} to study how task order and diversity benefits CL, we use $6$ image classification tasks, where ImageNet-1k~\cite{krizhevsky2012imagenet} is the first task, followed by CUBS~\cite{wah2011caltech}, Stanford Cars~\cite{krause20133d}, Flowers~\cite{nilsback2008automated}, WikiArt~\cite{saleh2015large} and Sketch~\cite{eitz2012humans}, Table~\ref{table:crl_dataset} provides the detail information of all datasets. In the experiment, we train a standard ResNet50 model on the last 5 tasks (CUBS to Sketch) to explore how the representation learned from ImageNet can benefit CRL. We follow the same learning hyperparameter setting as Table~\ref{CIFARtable}, where we use batch size of 128 and Adam optimizer with $(\beta_1,\beta_2)=(0,0.999)$. Also we train $60$ and $90$ epochs for pre-training and pruning with learning rate $0.01$, and then we fine-tune the free weights for $100$ epochs using cosine decay over learning rate starting from $0.01$. 
Specifically, we employ weight allocation $\alpha=0.1$ (Sec.~\ref{appsec: espnalgo}) in both individual and ESPN to enable continue representation learning. Compared with Individual/ESPN without pretrain, the ESPN with ImageNet pretrain employs a sparse pretrained ImageNet model from \cite{Wortsman2019DiscoveringNW} with 20\% non-zero weights. Table~\ref{table:crl_exp} shows the test accuracy when learning the 
 last 5 tasks and performances get improved in all 5 tasks.
}




\section{Proof of Theorems \ref{cl thm} and \ref{cl thm3}}\label{app B}


In this section and in Appendix \ref{app C}, we provide two main theorems: Theorem \ref{cl thm} and Theorem \ref{seq thm}. We will first prove Theorem \ref{cl thm} which adds $T$ new tasks by leveraging a frozen feature extractor $\phi_{\text{frz}}$. Secondly, observe that Theorem \ref{cl thm} adds the $T$ new tasks in a multitask fashion which models the setting where new tasks arrive in batches of size $T$. This still captures continual learning due to the use of the frozen feature-extractor that corresponds to the features built by earlier tasks. Also, setting $T=1$ in Theorem \ref{cl thm} corresponds to adding a single new task and updating the representation. Using this observation, we provide an additional result Theorem \ref{seq thm} where the tasks are added to the super-network sequentially. Theorem \ref{seq thm}, provided in Appendix \ref{app C}, arguably better captures the CL setting. In essence, it follows as an iterative applications of Theorem \ref{cl thm} after introducing proper definitions that capture the impact of imperfections due to finite sample learning (see Definition \ref{def pop seq} and Assumption \ref{fin comp}).



\subsection{Proof of Theorem \ref{cl thm}}
The original statement of Theorem \ref{cl thm} does not introduce certain terms formally due to space limitations. Here, we first add some clarifying remarks on this. Let $\mathcal{B}(S)$ return the smallest Euclidean ball containing the set $S$ (that lies on an Euclidean space). For a function $f:\mathcal{Z}\rightarrow\mathcal{Z}'$, define the $\lin{\cdot}$ norm to be the Lipschitz constant $\lin{f}:=\lin{f}^{\mathcal{Z}}=\sup_{\vct{x},\vct{y}\in \mathcal{B}(\mathcal{Z})}\frac{\tn{f(\vct{x})-f(\vct{y})}}{\tn{\vct{x}-\vct{y}}}$. Below, we assume that $\phi_{\text{frz}}$ and all $\phi_{\text{new}}\in{\boldsymbol{\Phi}}_{\text{new}}$, $h\in{\mtx{H}}$ are $L$-Lipschitz (on their sets of input features). Suppose $\mathcal{X}$ is the set of feasible input features and $\mathcal{X}$ has bounded Euclidian radius $R=\sup_{\vct{x}\in\mathcal{X}}\tn{\vct{x}}$. This means that input features of the classifier $h$ lie on the set $\mathcal{X}'=\phi_{\text{frz}}\circ{\mtx{X}}+{\boldsymbol{\Phi}}_{\text{new}}\circ{\mtx{X}}$ with Euclidian radius bounded by $2LR$. Thus both raw features and intermediate features \ylm{(output of $\phi\in{\boldsymbol{\Phi}}$)} are bounded {and we use these sets in Def.~\ref{def:cov}.} {Thus, we set $\bar{C}:=\max(\bar{C}_{\mathcal{X}},\bar{C}_{\mathcal{X}'})$ below.}

\begin{theorem}[Theorem \ref{cl thm} restated] \label{cl thm2} Recall that $({\vct{\hat{h}}},\hat{\phi}=\hat{\phi}_{\text{new}}+{\phi}_{\text{frz}})$ is the solution of \eqref{crl}. Suppose that the loss function $\ell(y,\hat{y})$ takes values on $[0,B]$ and is $\Gamma$-Lipschitz w.r.t.~$\hat{y}$. Draw $T$ independent datasets $\{(\vct{x}_{ti},y_{ti})\}_{i=1}^N\subset \mathcal{X}\times \mathcal{Y}$ for $t\in [T]$ where each dataset is distributed i.i.d.~according to ${\cal{D}}_t$. Suppose the input set $\mathcal{X}$ has bounded Euclidean radius. Suppose $\lin{\phi_{\text{frz}}},\lin{\phi_{\text{new}}},\lin{h}\leq L$ for all $\phi_{\text{new}}\in{\boldsymbol{\Phi}}_{\text{new}},h\in{\mtx{H}}$. With probability at least $1-2e^{-\tau}$, for some absolute constant $C>0$, the task-averaged population risk of the solution $({\vct{\hat{h}}},\hat{\phi})$ obeys 
\begin{align}
{\cal{L}}({\vct{\hat{h}}},\hat{\phi})&\leq {\cal{L}}^{\st}_{{\phi}_{\text{frz}}}+ CB\sqrt{\frac{(T\cc{{\mtx{H}}}+\cc{{\boldsymbol{\Phi}}_{\text{new}}})\log(\bar{C}\Gamma (L+1)NT)+\tau}{TN}},\nonumber\\
&\leq {\cal{L}}^{\st}+\text{MM}_{\text{frz}}}%^{\text{new}+CB\sqrt{\frac{(T\cc{{\mtx{H}}}+\cc{{\boldsymbol{\Phi}}_{\text{new}}})\log(\bar{C}\Gamma (L+1)NT)+\tau}{TN}}\nonumber.
\end{align}
\end{theorem}
\begin{proof} Below $c,C>0$ denote absolute constants. For a scalar $a$, define $a_+=a+1$. The proof uses a covering argument following the definition of the covering dimension. Fix $1>\varepsilon>0$. To start with, let ${\boldsymbol{\Phi}}_{\text{new},\eps}$ and ${\mtx{H}}_\varepsilon$ be $\varepsilon$-covers (per Definition \ref{def:cov}) of the sets ${\boldsymbol{\Phi}}_{\text{new}}$ and ${\mtx{H}}$ respectively. Let ${\mtx{H}}_\varepsilon^T$ be $T$-times Cartesian product of ${\mtx{H}}_\varepsilon$. Our goal is bounding the supremum of the gap between the empirical and population risks to conclude with the result. 

\noindent\textbf{$\bullet$ Step 1: Union bound over the cover.} Following Definition \ref{def:cov}, we have that $\log|{\mtx{H}}^T_\varepsilon|\leq T\cc{{\mtx{H}}}\log(\bar{C}/\varepsilon)$, $\log|{\boldsymbol{\Phi}}_{\text{new},\eps}| \leq \cc{{\boldsymbol{\Phi}}_{\text{new}}}\log(\bar{C}/\varepsilon)$. Define $\mathcal{S}={\boldsymbol{\Phi}}_{\text{new},\eps}\times {\mtx{H}}^T_{\varepsilon}$. These imply that
\[
\log|\mathcal{S}|\leq D\quad\text{where}\quad  D:=(\cc{{\boldsymbol{\Phi}}_{\text{new}}}+T\cc{{\mtx{H}}})\log\left(\frac{\bar{C}}{\varepsilon}\right)
\]
Set $t=B\sqrt{\frac{D+\tau}{cNT}}$ for an absolute constant $c>0$ which corresponds to the concentration rate of the Hoeffding inequality that follows next. Using the fact that the loss function is bounded in $[0,B]$, applying a Hoeffding bound over all elements of the cover, we find that for all }%\textcolor{blue}{$(\vct{{\vct{h}}},\phi)\in \mathcal{S}$}
\begin{align}
\mathbb{P}(|{\cal{L}}(\vct{{\vct{h}}},\phi)-{\widehat{\cal{L}}}(\vct{{\vct{h}}},\phi)|\geq t)&\leq 2|\mathcal{S}|e^{-\frac{cNTt^2}{B^2}}\nonumber\\
&\leq 2|\mathcal{S}|e^{-D-\tau}=2e^{\log|\mathcal{S}|-D-\tau}\nonumber\\
&\leq 2e^{-\tau}.\label{prob bound}
\end{align}

\noindent\textbf{$\bullet$ Step 2: Perturbation analysis.} We showed the concentration on the cover. Now we need to relate the cover to the continuous set. Given any candidate $\phi=\phi_{\text{new}}+\phi_{\text{frz}}$ with $\phi_{\text{new}}\in{\boldsymbol{\Phi}}_{\text{new}},\vct{{\vct{h}}}:=(h_t)_{t=1}^T\in{\mtx{H}}^T$, we draw a neighbor $\phi'=\phi_{\text{new}}'+\phi_{\text{frz}}$ with $\phi_{\text{new}}'\in{\boldsymbol{\Phi}}_{\text{new},\eps},\vct{{\vct{h}}}':=(h'_t)_{t=1}^T\in{\mtx{H}}_\varepsilon^T$.


Recall that $\lin{\phi_{\text{new}}},\lin{h}\leq L$ for all $\phi_{\text{new}}\in{\boldsymbol{\Phi}}_{\text{new}},h\in{\mtx{H}}$. The task-averaged risk perturbation relates to the individual risks which in turn relates to individual examples as follows. Let $f_t=h_t\circ \phi$ for all $t\in[T]$.
\begin{align} 
|{\cal{L}}(\vct{{\vct{h}}},\phi)-{\cal{L}}(\vct{{\vct{h}}}',\phi')|&\leq \sup_{1\leq t\leq T} |{\cal{L}}_t(h_t,\phi)-{\cal{L}}_t(h_t',\phi')|\\
&\leq\sup_{i\in[n],t\in[T]}|\ell(y_{ti},f_t(\vct{x}_{ti}))-\ell(y_{ti},f'_t(\vct{x}_{ti}))|\\
&\leq\Gamma\sup_{i\in[n],t\in[T]}|f_t(\vct{x}_{ti})-f'_t(\vct{x}_{ti})|.\label{pert1 bound}
\end{align}
The perturbations for the individual examples are bounded via triangle inequalities as follows.
\begin{align}
|f_t(\vct{x}_{ti})-f'_t(\vct{x}_{ti})|\leq  &|h_t\circ\phi(\vct{x}_{ti})-h'_t\circ\phi'(\vct{x}_{ti})|\nonumber\\
\leq & (|h_t\circ\phi(\vct{x}_{ti})-h_t\circ\phi'(\vct{x}_{ti})|+|h_t\circ\phi'(\vct{x}_{ti})-h'_t\circ\phi'(\vct{x}_{ti})|)\nonumber\\
\leq& (L+1)\varepsilon:=L_+\varepsilon.\label{last line}
\end{align}
The last line follows from the triangle inequality via $\varepsilon$-covering and $L$-Lipschitzness of $h$ and ${\phi}$ as follows
\begin{itemize}
\item Since $\tn{\phi(\vct{x}_{ti})-\phi'(\vct{x}_{ti})}=\tn{\phi_{\text{new}}(\vct{x}_{ti})-\phi_{\text{new}}'(\vct{x}_{ti})}\leq \varepsilon\implies |h_t\circ\phi(\vct{x}_{ti})-h_t\circ\phi'(\vct{x}_{ti})|\leq L\varepsilon$.
\item Set $\vct{v}=\phi'(\vct{x}_{ti})$. Since $\tn{h(\vct{v})-h'(\vct{v})}\leq \varepsilon\implies |h'_t\circ\phi'(\vct{x}_{ti})-h_t\circ\phi'(\vct{x}_{ti})|\leq \varepsilon$.
\end{itemize}
Combining these, following \eqref{pert1 bound}, and repeating the identical perturbation argument \eqref{last line} for the empirical risk ${\widehat{\cal{L}}}$, we obtain
\begin{align}
\max(|{\cal{L}}(\vct{{\vct{h}}},\phi)-{\cal{L}}(\vct{{\vct{h}}}',\phi')|, |{\widehat{\cal{L}}}(\vct{{\vct{h}}},\phi)-{\widehat{\cal{L}}}(\vct{{\vct{h}}}',\phi')|)\leq \Gamma L_+\varepsilon.\label{pert bound}
\end{align}

\noindent\textbf{$\bullet$ Step 3: Putting things together.} Combining \eqref{prob bound} and \eqref{pert bound}, we found that, with probability at least $1-2e^{-\tau}$, for all ${\phi}=\phi_{\text{new}}+\phi_{\text{frz}}$ with $\phi_{\text{new}}\in{\boldsymbol{\Phi}}_{\text{new}}$ and all $\vct{{\vct{h}}}\in{\mtx{H}}^T$, we have that
\begin{align}
|{\cal{L}}(\vct{{\vct{h}}},\phi)-{\widehat{\cal{L}}}(\vct{{\vct{h}}},\phi)|\leq 2\Gamma L_+\varepsilon+B\sqrt{\frac{D+\tau}{cNT}}.
\end{align}
Setting }%\textcolor{blue}{$\varepsilon=\frac{1}{\Gamma L_+NT}$}, for an updated constant $C>0$, we find
\begin{align}
|{\cal{L}}(\vct{{\vct{h}}},\phi)-{\widehat{\cal{L}}}(\vct{{\vct{h}}},\phi)|\leq CB\sqrt{\frac{D+\tau}{NT}}.\label{unif conv}
\end{align}
where }%\textcolor{blue}{$D=(\cc{{\boldsymbol{\Phi}}_{\text{new}}}+T\cc{{\mtx{H}}})\log(\bar{C}\Gamma L_+NT)$} following the above definition of $D$.

Note that, the uniform concentration above also implies the identical bound for the minimizer of the empirical risk. Let $(\hat\vct{{\vct{h}}},\hat{\phi})$ be the minimizer of the empirical risk. Specifically, let $(\vct{{\vct{h}}}^{\star,\phi_{\text{frz}}},\phi_{\text{new}}^{\star,\phi_{\text{frz}}})$ be the optimal hypothesis in $({\mtx{H}},{\boldsymbol{\Phi}}_{\text{new}})$ minimizing the population risk subject to using frozen feature extractor $\phi_{\text{frz}}$, that is, ${\cal{L}}(\vct{{\vct{h}}}^{\star,\phi_{\text{frz}}},\phi_{\text{new}}^{\star,\phi_{\text{frz}}}+\phi_{\text{frz}})={\cal{L}}^\star_{\phi_{\text{frz}}}$. Then, we note that
\[
{\cal{L}}(\hat\vct{{\vct{h}}},\hat{\phi})\leq {\widehat{\cal{L}}}(\hat\vct{{\vct{h}}},\hat{\phi})+CB\sqrt{\frac{D+\tau}{NT}}\leq  {\widehat{\cal{L}}}(\vct{{\vct{h}}}^{\star,\phi_{\text{frz}}},\phi_{\text{new}}^{\star,\phi_{\text{frz}}}+\phi_{\text{frz}})+CB\sqrt{\frac{D+\tau}{NT}}\leq {\cal{L}}^\star_{\phi_{\text{frz}}}+2CB\sqrt{\frac{D+\tau}{NT}}.
\]
This concludes the proof of the main statement (first line). The second statement follows directly from the definition of mismatch, that is, using the fact that $\text{MM}_{\text{frz}}}%^{\text{new}={\cal{L}}^\star_{\phi_{\text{frz}}}-{\cal{L}}^\star$.
\end{proof}


\subsection{Proof of Theorem \ref{cl thm3}}

\begin{proof}
Within this setting, classifier heads correspond to $h(\vct{a}):=h_{\vct{v}}(\vct{a})=\sigma(\vct{v}^\top\psi(\vct{a}))$ and ${\mtx{H}}=\{h_{\vct{v}}{~\big |~} \tn{\vct{v}}\leq B\}$. Similarly, feature representations  correspond to $\phi(\vct{x})=\mtx{W}\vct{x}$, $\phi_{\text{frz}}(\vct{x})=\mtx{W}_\text{frz}\vct{x}$, $\phi_{\text{new}}(\vct{x}):=\phi_{\text{new}}^{\mtx{W}_\text{new}}(\vct{x})=\mtx{W}_\text{new}\vct{x}$ where $\mtx{W}=\mtx{W}_\text{frz}+\mtx{W}_\text{new}\in\mathbb{R}^{r\times d}$. The hypothesis set becomes ${\boldsymbol{\Phi}}_{\text{new}}=\{\phi_{\text{new}}^{\mtx{W}_\text{new}}{~\big |~} \mtx{W}_\text{new}\in{\cal{W}}\}$.

Here $\mtx{W}_\text{new}\in{\cal{W}}$ is the weights of the new feature representation to learn on top of $\mtx{W}_\text{frz}$. Importantly, $\mtx{W}_\text{new}$ only learns the last $r_\text{new}$ rows since first $r_\text{frz}$ rows are fixed by frozen feature extractor $\mtx{W}_\text{frz}$. For the proof, we simply need to plug in the proper quantities within Theorem \ref{cl thm}. First, observe that ${\cal{L}}^\star=\operatorname{\mathbb{E}}[Z]^2$ since $Z$ is independent zero-mean noise thus for any predictor using $\hat{y}:=\hat{y}(\vct{x})$ input features we have
\[
\operatorname{\mathbb{E}}[\ell(y,\hat{y})]=\operatorname{\mathbb{E}}[(y(\vct{x})+Z-\hat{y}(\vct{x}))^2]\geq \operatorname{\mathbb{E}}[Z^2],
\]
where $y(\vct{x})=\sigma({\vct{v}^\star_t}^\top \psi(\mtx{W}^\star\vct{x}))$ is the noiseless label.

We next prove that ${\cal{L}}^\star_\text{frz}={\cal{L}}^\star=\operatorname{\mathbb{E}}[Z^2]$. This simply follows from the fact that frozen representation $\mtx{W}_\text{frz}$ is perfectly compatible with ground-truth representation $\mtx{W}^\star$. Specifically, observe that $\mtx{W}^\star_\text{new}:=\mtx{W}^\star-\mtx{W}_\text{frz}$ lies within the hypothesis set ${\cal{W}}$ since by construction $\|\mtx{W}^\star_\text{new}\|\leq \|\mtx{W}^\star\|\leq \bar{B}$ and $\mtx{W}^\star_\text{new}$ is zero in the first $r_\text{frz}$ rows. Similarly $(\vct{v}^\star_t)_{t=1}^T$ obey the $\ell_2$ norm constraint $\tn{\vct{v}^\star_t}\leq B$. Thus, $\mtx{W}^\star_\text{new},{\mtx{V}}^\star=(\vct{v}^\star_t)_{t=1}^T$ are feasible solutions of the hypothesis space and since $\mtx{W}^\star_\text{new}+\mtx{W}_\text{frz}=\mtx{W}^\star$, for this choice we have that $\hat{y}(\vct{x})=y(\vct{x})$ thus task-specific risks induced by $(\vct{v}^\star_t,\mtx{W}^\star)$ obey ${\cal{L}}_t(\vct{v}^\star_t,\mtx{W}^\star)=\operatorname{\mathbb{E}}[Z^2]$ for all $t\in[T]$. Consequently, the task-averaged risk obeys ${\cal{L}}({\mtx{V}}^\star,\mtx{W}^\star)=\operatorname{\mathbb{E}}[Z^2]$ proving aforementioned claim.

The remaining task is bounding the covering dimensions of the hypothesis sets ${\mtx{H}}$ and ${\boldsymbol{\Phi}}_{\text{new}}$ and verifying Lipschitzness. The Lipschitzness of $h(\vct{a})=\sigma(\vct{v}^\top\psi(\vct{a}))\in{\mtx{H}}$, $\phi_{\text{frz}}(\vct{x})=\mtx{W}_\text{frz}\vct{x}$, $\phi_{\text{new}}(\vct{x})=\mtx{W}_\text{new}\vct{x}\in{\boldsymbol{\Phi}}_{\text{new}}$ follows from the fact that all $\mtx{W}_\text{new}\in{\cal{W}},\mtx{W}^\star,\mtx{W}_\text{frz}$ have spectral norms bounded by $\bar{B}$, and the fact that, the Lipschitz constant of $h$ (denoted by $\frac{5B^2\tn{\vct{y}}}{\laz^2}{\cdot}$) can be bounded as $\frac{5B^2\tn{\vct{y}}}{\laz^2}{h}\leq \frac{5B^2\tn{\vct{y}}}{\laz^2}{\sigma} \frac{5B^2\tn{\vct{y}}}{\laz^2}{\psi}\tn{\vct{v}}\leq \bar{L}^2 B$ where $\bar{L}=\max(\frac{5B^2\tn{\vct{y}}}{\laz^2}{\psi},\frac{5B^2\tn{\vct{y}}}{\laz^2}{\sigma})$. Recall that dependence on the Lipschitz constant is logarithmic. 

{What remains is determining the covering dimensions of ${\mtx{H}},{\boldsymbol{\Phi}}_{\text{new}}$ which simply follows from covering the parameter spaces of $\vct{v}\in\mathcal{B}^d(B),\mtx{W}_\text{new}\in{\cal{W}}$. Here $\mathcal{B}^d(B)$ is defined to be the Euclidean ball of radius $B$. Suppose $\mathcal{X}$ lies on an Euclidean ball of radius $R$ and let $\mathcal{F}=(\{\phi_{\text{frz}}\}+{\boldsymbol{\Phi}}_{\text{new}})\circ \mathcal{X}$ be the feature representations. Since $\mtx{W}_\text{frz}$ and $\mtx{W}_\text{new}$ have spectral norm at most $\bar{B}$, $\mathcal{F}$ is subset of Euclidean ball of radius $2\bar{B} R$.}

{Fix an $\varepsilon_0=\frac{\varepsilon}{2\bar{B} R\bar{L}^2}$ $\ell_2$-cover of $\mathcal{B}^d(B)$. This cover has cardinality at most $(\frac{6B\bar{B} R\bar{L}^2}{\varepsilon})^d$ and induces an $\varepsilon$ cover of ${\mtx{H}}$. To see this, given any $\vct{f}\in\mathcal{F}$ and $\vct{v}\in \mathcal{B}^d(B)$, there exists a cover element $\vct{v}'$ with $\tn{\vct{v}'-\vct{v}}\leq \varepsilon_0$, as a result $|h_{\vct{v}'}(\vct{f})-h_{\vct{v}}(\vct{f})|\leq \bar{L}^2\tn{\vct{f}}\tn{\vct{v}'-\vct{v}}\leq \varepsilon$. Consequently $\cc{{\mtx{H}}}=d$. Similarly, since elements of ${\cal{W}}$ has $r_\text{new} d$ nonzero parameters (and recalling that ${\cal{W}}$ is also subset of spectral norm ball of radius $\bar{B}$) ${\cal{W}}$ admits a $\frac{\varepsilon}{R}$ Frobenius cover of cardinality $(\frac{3R\bar{B}\sqrt{r_\text{new}}}{\varepsilon})^{r_\text{new} d}$. Consequently, for any $\phi_{\mtx{W}}$ with $\mtx{W}=\mtx{W}_\text{new}+\mtx{W}_\text{frz}$, there exists  a cover element $\phi_{\mtx{W}'}$ with $\|\mtx{W}'_{\text{new}}-\mtx{W}_\text{new}\|\leq \tf{\mtx{W}'_\text{new}-\mtx{W}_\text{new}}\leq \frac{\varepsilon}{R}$, such that for all $\vct{x}\in\mathcal{X}$, we have that $|\phi_{\mtx{W}}-\phi_{\mtx{W}'}|\leq \|\mtx{W}'_{\text{new}}-\mtx{W}_\text{new}\|\tn{\vct{x}}\leq \varepsilon$. Consequently $\cc{{\boldsymbol{\Phi}}_{\text{new}}}\leq r_\text{new} d$. These bounds on covering dimensions conclude the proof after applying Theorem \ref{cl thm}.}
\end{proof}











\section{Theoretical Analysis of Adding $T$ New Tasks Sequentially (Theorem \ref{seq thm})}\label{app C}



Theorem \ref{cl thm} adds $T$ tasks simultaneously on frozen feature extractor $\phi_{\text{frz}}$. Below, we consider the setting where we add these tasks sequentially via repeated applications of Theorem \ref{cl thm} and a new task $t$ builds upon the cumulative representation learned from tasks $1$ to $t-1$. This setting better reflects what actually happens in continual learning and in our experiments but is more involved because representation quality of an earlier task will impact the accuracy of the future tasks. To this end, we first describe the learning setting and assumptions on the representation mismatch.

\noindent\textbf{Sequential learning setting:} We will learn a new task with index $t$ over the hypothesis set ${\boldsymbol{\Phi}}_{\text{new}}^t$ for $t\in[T]$. Suppose we are at task $t$. That is, we assume that we have already built incremental (continual) feature-extractors $\phi_{\text{new}}^1,\dots,\phi_{\text{new}}^{t-1}$ for tasks $1$ through $t-1$ where each one obeys $\phi_{\text{new}}^\tau\in{\boldsymbol{\Phi}}_{\text{new}}^\tau$. Thus, the (cumulative) frozen representation at time $t$ is given by
\[
\phi_{\text{frz}}^t=\phi_{\text{frz}}+\sum_{i=1}^{t-1} \phi_{\text{new}}^i.
\]
Here $\phi_{\text{frz}}:=\phi_{\text{frz}}^1$ is the representation built before any new task arrived. Using $\phi_{\text{frz}}^t$, we solve the following (essentially identical) variation of \eqref{crl} where \textbf{we focus on task $t$ given the outcome of the continual learning procedure until task $t-1$.}
\begin{align}
h^t,&\phi_{\text{new}}^t=\underset{\phi=\phi_{\text{new}}+{\phi}_{\text{frz}}^t}{\underset{h\in{\mtx{H}},\phi_{\text{new}}\in{\boldsymbol{\Phi}}_{\text{new}}^t}{\arg\min}} {\widehat{\cal{L}}}_{\mathcal{S}_t}(f)=\frac{1}{N}\sum_{i=1}^N \ell(y_{ti},f(\vct{x}_{ti}))\quad\text{where}\quad f=h\circ\phi. \tag{CRL-SEQ}\label{crlseq}
\end{align}
After obtaining $\phi_{\text{new}}^t$, the feature-extractor of task $t$ is given by $\phi^t=\phi_{\text{frz}}^t+\phi_{\text{new}}^t$ and prediction function becomes $f^t=h^t\circ \phi^t$. Finally, $\phi^t$ of task $t$ becomes the next frozen feature-extractor i.e.~$\phi_{\text{frz}}^{t+1}=\phi^t$.

In this sequential setting, intuitively $({\boldsymbol{\Phi}}_{\text{new}}^\tau)_{\tau=1}^T$ are less complex hypothesis spaces compared to ${\boldsymbol{\Phi}}_{\text{new}}$ of Theorem \ref{cl thm}. This is because we learn ${\boldsymbol{\Phi}}_{\text{new}}^\tau$ using a single task. In that sense, the proper scaling of hypothesis set complexity is $\cc{{\boldsymbol{\Phi}}_{\text{new}}^\tau}\propto \cc{{\boldsymbol{\Phi}}_{\text{new}}}/T$ for $\tau\in[T]$. Specifically, we assume that for some global value $\mathcal{C}^{{\boldsymbol{\Phi}}}_{\text{new}}>0$
\begin{align}
    \cc{{\boldsymbol{\Phi}}_{\text{new}}^\tau}\leq \mathcal{C}^{{\boldsymbol{\Phi}}}_{\text{new}}\quad \text{for all}\quad 1\leq\tau\leq T.\label{ccn decay}
\end{align}
Secondly, compared to Theorem \ref{cl thm}, we need to introduce a more intricate compatibility condition to assess the benefit of the representations learned from finite data $\phi_{\text{new}}^1,\dots,\phi_{\text{new}}^{t-1}$ for the new task $t$. This will be accomplished by first introducing population level compatibility and then introducing an assumption that controls the impact of finite sample learning on the new task. These definition and assumption arise naturally to control the learnability of a new task given features of earlier tasks. Related assumptions (e.g.~\emph{task diversity} condition) have been used by other works for transfer/meta learning purposes \cite{tripuraneni2020theory,oymak2021generalization,du2020few,xu2021representation}.

Let $(h^{\star,1},\phi_{\text{new}}^{\star,1}),\dots,(h^{\star,t},\phi_{\text{new}}^{\star,t}),\dots$ be the (classifier, representation) sequence obtained by solving \eqref{crlseq} using infinite samples $N=\infty$ (that is, solving the population-level optimization rather than finite-sample ERM). The following definition introduces the representation mismatch at task $t$ to capture the suitability of the population-level representations $(\phi_{\text{new}}^{\star,\tau})_{\tau=1}^{t-1}$ for a new task $t$. Set $\phi_{\text{frz}}^{\star,t}=\phi_{\text{frz}}+\sum_{\tau=1}^{t-1}\phi_{\text{new}}^{\star,\tau}$ and define the set of all feasible representations for task $t$ as }%\textcolor{blue}{${\boldsymbol{\Phi}}^t=\sum_{\tau=1}^t{\boldsymbol{\Phi}}_{\text{new}}^\tau+{\boldsymbol{\Phi}}_{\text{frz}}$\red{$\subseteq{\boldsymbol{\Phi}}$}.}
\begin{definition}[Population quantities and representation mismatch]\label{def pop seq}For task $t$, define the population (infinite-sample) risk as ${\cal{L}}_t(h,\phi)=\operatorname{\mathbb{E}}[{\widehat{\cal{L}}}_{\mathcal{S}_t}(h,\phi)]$. Define the optimal risk of task $t$ over all feasible representations in \red{${\boldsymbol{\Phi}}$} as $\Lci{t}=$ $\min_{h\in{\mtx{H}},\phi\in\red{{\boldsymbol{\Phi}}}}{\cal{L}}_t(h,\phi)$. Note that the optimal risk gets to choose the best representations within $({\boldsymbol{\Phi}}_{\text{new}}^\tau)_{\tau=1}^t$ and ${\boldsymbol{\Phi}}_{\text{frz}}$ \red{since $\sum_{\tau=1}^t{\boldsymbol{\Phi}}_{\text{new}}^\tau+{\boldsymbol{\Phi}}_{\text{frz}}\subseteq {\boldsymbol{\Phi}}$.} Finally, define the optimal risk with fixed frozen model $\phi_{\text{frz}}^{\star,t}=\phi_{\text{frz}}+\sum_{\tau=1}^{t-1}\phi_{\text{new}}^{\star,\tau}$ to be $\Lci{t}_\text{seq}=\min_{h\in{\mtx{H}},\phi_{\text{new}}^t\in{\boldsymbol{\Phi}}_{\text{new}}^t}{\cal{L}}_t(h,\phi)$ s.t.~$\phi=\phi_{\text{new}}^t+\phi_{\text{frz}}^{\star,t}$. The sequential representation mismatch at task $t$ is defined as
\begin{align}
\MS{t}=\Lci{t}_\text{seq}-\Lci{t}.\label{mst}
\end{align}
\end{definition}
This definition is the sequential counterpart of Definitions \ref{def pop} and \ref{def MM}. It quantifies the cost of continual learning with respect to choosing the best (oracle) representation. It also aims to capture the properties of the task distributions thus it uses infinite samples for tasks $1\leq \tau\leq t$. In practice, a new task $t$ is learned on top of finite sample tasks. We need to make a plausible assumption to formalize
\begin{quote}
\emph{With enough samples, representations learned from finite sample tasks are almost as useful as representations learned from infinite sample tasks.}
\end{quote}
We accomplish this by introducing empirical/population compatibility below. The basic idea is that, quality of the representation should decay gracefully as we move from infinite to finite samples.
\begin{assumption} [Empirical/population compatibility] \label{fin comp} For task $t$, define the population risk ${\cal{L}}_t(h,\phi)=\operatorname{\mathbb{E}}[{\widehat{\cal{L}}}_{\mathcal{S}_t}(h,\phi)]$. Recall the definitions of $\phi_{\text{frz}},(h^{\star,t},\phi_{\text{new}}^{\star,t})_{t\geq 1}$ from Def.~\ref{def pop seq}. Given a sequence of incremental feature-extractors ${\phi}=(\phi_{\text{new}}^\tau)_{t\geq 1}$, recall from \eqref{crlseq} that task $\tau-1$ uses the extractor $\phi_{\text{frz}}^{\tau}=\phi_{\text{frz}}+\sum_{i=1}^{\tau-1}\phi_{\text{new}}^{i}$. \emph{To quantify representation quality}, we introduce the risks ${\bar{\cal{L}}}^t_\text{seq}:={\bar{\cal{L}}}^{t,{\phi}}_\text{seq},{\cal{L}}^t_\text{seq}:={\cal{L}}^{t,{\phi}}_\text{seq}$ induced by ${\phi}$ (similar to Def.~\ref{def pop seq})
\begin{align}
&{\bar{\cal{L}}}^t_\text{seq}=\min_{h\in{\mtx{H}}}{\cal{L}}_t(h,\phi_{\text{frz}}^{t+1})\nonumber\\
&{\cal{L}}^t_\text{seq}=\min_{h\in{\mtx{H}},\phi_{\text{new}}\in{\boldsymbol{\Phi}}_{\text{new}}^t}{\cal{L}}_t(h,\phi)~\text{s.t.}~\phi=\phi_{\text{new}}+\phi_{\text{frz}}^t.\label{ltseq}
\end{align}
Here ${\bar{\cal{L}}}^t_\text{seq}$ uses the given $\phi_{\text{new}}^t$ (within ${\phi}$) whereas ${\cal{L}}^t_\text{seq}$ chooses the optimal $\phi_{\text{new}}^t\in {\boldsymbol{\Phi}}_{\text{new}}^t$\footnote{${\cal{L}}^t_\text{seq}$ definition is needed to quantify the representation quality of a new task where incremental update $\phi_{\text{new}}^t$ has not been built yet. In contrast, ${\bar{\cal{L}}}^t_\text{seq}$ quantifies the representation quality of previous tasks for which (full) feature extractors are known.}. Thus ${\bar{\cal{L}}}^t_\text{seq}\geq {\cal{L}}^t_\text{seq}$. Based on these, define the mismatch between empirical and population-level optimizations
\begin{align}\nonumber
\ME{t}={\cal{L}}^t_\text{seq}-\Lci{t}_\text{seq}\quad\text{and}\quad\MA{t}={\bar{\cal{L}}}^t_\text{seq}-\Lci{t}_\text{seq}.
\end{align}
Again by construction {$\MA{t}\geq \ME{t}$}. We say \textbf{empirical and population representations} are compatible if there exists a constant $\eps_0>0$ such that, for all choices of $(\phi_{\text{new}}^\tau)_{\tau=1}^t\in {\boldsymbol{\Phi}}_{\text{new}}^1\times \dots{\boldsymbol{\Phi}}_{\text{new}}^t$, we have that
\
}%\textcolor{blue}{\underbrace{\ME{t}}_{\text{subopt on new task}}\leq\underbrace{\eps_0}_{\text{additive mismatch}}+\underbrace{\frac{1}{t-1} \sum_{\tau=1}^{t-1} \MA{\tau}}_{\text{avg subopt on first $t-1$ tasks}}}
\]
\end{assumption}
\textbf{Interpretation:} Here, }%\textcolor{blue}{$\frac{1}{t-1} \sum_{\tau=1}^{t-1} \MA{\tau}$} quantifies the suboptimality of the representations used by first $t-1$ tasks. Recall that task $\tau$ uses representation $\phi_{\text{frz}}^\tau$ for $\tau\leq t-1$. Verbally, this assumption guarantees that, task $t$ can find an (incremental) representation $\phi_{\text{new}}^t\in{\boldsymbol{\Phi}}_{\text{new}}^t$ and classifier $h\in{\mtx{H}}$ such that its suboptimality to population-optimal risk $\Lci{t}_\text{seq}$ is upper bounded in terms of the average of the suboptimalities over the first $t-1$ tasks. Note that, this can also be viewed as a \textbf{sequential task diversity} condition because we are assuming that good quality representations on the first $t-1$ tasks (w.r.t.~population minima) ensure a small excess risk (w.r.t.~population minima) on the new task.


Following this assumption, theorem below is our main guarantee on sequential CRL. It is obtained by stitching $T$ applications of Lemma \ref{lem seq} which adds a single new task. The statement of Lemma \ref{lem seq} is provided within the proof of Theorem \ref{seq thm} below.
\begin{theorem}\label{seq thm} Suppose we solve the sequential continual learning problem \eqref{crlseq} for each task $1\leq t\leq T$ to obtain hypothesis $(h^t,\phi_{\text{new}}^t)_{t=1}^T$. The $t$'th model uses the prediction $h^t\circ \phi_{\text{frz}}^{t+1}$ where $\phi_{\text{frz}}^t=\phi_{\text{frz}}+\sum_{\tau=1}^{t-1}\phi_{\text{new}}^t$. Consider the same core setting as in Theorem \ref{cl thm}: Namely, we assume Lipschitz hypothesis sets ${\boldsymbol{\Phi}}_{\text{new}}^t,{\mtx{H}}$, Lipschitz loss function $\ell:\mathbb{R}\times \mathbb{R}\rightarrow[0,1]$ and bounded input feature set $\mathcal{X}$ all with respect to Euclidean distance. Recall that $\Lci{t}$ is the optimal risk for task $t$. Suppose the complexity of each ${\boldsymbol{\Phi}}_{\text{new}}^t$ is upper bounded by $\mathcal{C}^{{\boldsymbol{\Phi}}}_{\text{new}}$ as in \eqref{ccn decay} for $t\in [T]$. For some absolute constant $C>0$, with probability $1-2Te^{-\tau}$, the solutions $(h^t,\phi_{\text{new}}^t)_{t=1}^T$ of \eqref{crlseq} satisfies the following cumulative generalization bound (summed over all $T$ tasks)
\begin{align
\underbrace{\sum_{t=1}^T\left({\cal{L}}_t(h^t,\phi^t)-\Lci{t}\right)}_{\text{excess test risk w.r.t.~oracle}}&\leq \underbrace{\sum_{t=1}^T\MS{t}}_{\text{sequential representation mismatch}}+\underbrace{T^2(\eps_0+C\sqrt{\frac{\cc{{\mtx{H}}}+\mathcal{C}^{{\boldsymbol{\Phi}}}_{\text{new}}+\tau}{N}})}_{\text{cost of finite sample learning}}.\label{gen bound 3}
\end{align}
In light of Def.~\ref{def pop seq}, we can write the suboptimality with respect to solving sequential problems with $N=\infty$ as
\begin{align
\underbrace{\sum_{t=1}^T\left({\cal{L}}_t(h^t,\phi^t)-\Lci{t}_\text{seq}\right)}_{\text{excess test risk w.r.t.~sequential learning}}&\leq \underbrace{T^2(\eps_0+C\sqrt{\frac{\cc{{\mtx{H}}}+\mathcal{C}^{{\boldsymbol{\Phi}}}_{\text{new}}+\tau}{N}})}_{\text{cost of finite sample learning}}.\label{gen bound 4}
\end{align}
\end{theorem}
\noindent \textbf{Discussion.} Before providing the proof in the next section, a few remarks are in place. First, we state the sum of test errors rather than the average. Secondly, observe that \eqref{gen bound 3} compares the test errors to the optimal possible errors $\Lci{t}$. On the right hand side there are two terms: ``representation mismatch'' and ``cost of finite sample learning''. 

``Representation mismatch'' quantifies the population-level error that arises even if each task had access to infinite samples. This is because, even if each task solved \eqref{crlseq} perfectly, the resulting sequence of representations does not have to be optimal for the next task and $\MS{t}$ precisely captures this suboptimality. Recall that this population-level gap arises from Definition \ref{def pop seq}. \red{This also emphasizes that we should train diverse tasks firstly, since diverse features learned from previous tasks help reduce $\MS{t}$ due to its highly relevant representation. }

The ``cost of finite sample learning'' term originally captures the finite sample effects, and it is proportional to the statistical error rate of solving \eqref{crlseq} for the first task-only i.e.~$\sqrt{\frac{\mathcal{C}({\mtx{H}})+\mathcal{C}^{{\boldsymbol{\Phi}}}_{\text{new}}}{N}}$. Here, recall from \eqref{ccn decay} that $\mathcal{C}^{{\boldsymbol{\Phi}}}_{\text{new}}$ is an upper bound to the complexities of ${\boldsymbol{\Phi}}_{\text{new}}^1,\dots,{\boldsymbol{\Phi}}_{\text{new}}^T$. Perhaps unexpected dependence is the quadratic growth in $T$. This is in contrast to linear growth one would get from adding tasks simultaneously as in Theorem \ref{cl thm}. This quadratic growth arises from the accumulation of the finite-sample representation suboptimalities as we add more tasks. Specifically, Assumption \ref{fin comp} helps guarantee that feature-extractors of tasks $1,\dots,t-1$ are useful for task $t$; however, as more tasks are added they incur more divergence from $(\phi^{\star,t})_{t=1}^T$. Each new task has a finite sample and contributes to this divergence and our analysis leads to $\order{T^2}$ upper bound on the error. $\eps_0$ is an additional mismatch term that makes Assumption \ref{fin comp} significantly more flexible (albeit ideally, it is close to zero). Finally, it would be interesting to explore the tightness of these bounds for concrete analytical settings (e.g.~\ref{crlseq} with linear models or neural nets), running more experiments and further studying the role of finite sample effects \& representation divergences. 
\subsection{Proof of Theorem \ref{seq thm}}
\begin{proof} Applying Lemma \ref{lem seq} for each task $1\leq t\leq T$ and union bounding, }%\textcolor{blue}{with probability $1-2Te^{-\tau}$}, for all $T$ applications of \eqref{crlseq}, we have that
\begin{align}
&{\cal{L}}(h^t,\phi_{\text{new}}^t+\phi_{\text{frz}}^t)-\Lci{t}\leq \MS{t}+\underbrace{\ME{t}+\sqrt{\frac{\ordet{\cc{{\mtx{H}}}+\mathcal{C}^{{\boldsymbol{\Phi}}}_{\text{new}}+\tau}}{N}}}_{\text{excess empirical error}}\label{gen bound}\\
&\MA{t}\leq \sqrt{\frac{\ordet{\cc{{\mtx{H}}}+\mathcal{C}^{{\boldsymbol{\Phi}}}_{\text{new}}+\tau}}{N}}+\ME{t}.\nonumber
\end{align}
}%\textcolor{blue}{We simply need to control $\ME{t}$ at time $t$. Following Assumption \ref{fin comp}, observe that at $t=1$ we simply have $\ME{1}=0$. For $\ME{t}$ we have that
\begin{align}
&\ME{t}\leq \eps_0+\frac{1}{t-1} \sum_{\tau=1}^{t-1} \MA{\tau}\nonumber\\
&\implies \MA{t}\leq \left[\eps_0+\sqrt{\frac{\ordet{\cc{{\mtx{H}}}+\mathcal{C}^{{\boldsymbol{\Phi}}}_{\text{new}}+\tau}}{N}}\right]+\frac{\sum_{\tau=1}^{t-1}\MA{\tau}}{t-1}.\label{required eq}
\end{align}
Declare $B:=\eps_0+\sqrt{\frac{\ordet{\cc{{\mtx{H}}}+\mathcal{C}^{{\boldsymbol{\Phi}}}_{\text{new}}+\tau}}{N}}$. We will inductively prove that for all $t$
\begin{align}
\MA{t}\leq (t-1) B.\label{induct}
\end{align}
For $t=1$, obviously it works. Suppose claim holds until time $t-1$. Using \eqref{required eq} this implies that
\[
\MA{t}\leq B+\frac{\sum_{\tau=1}^{t-1}\MA{\tau}}{t-1}\leq B+  \frac{\sum_{\tau=1}^{t-1}(\tau-1)}{t-1}B=B+\frac{t-2}{2}B=\frac{Bt}{2}\leq B(t-1).
\]
The last inequality holds since $t\geq 2$. Thus the claim holds for $t$ as well.
}Q

To proceed, using the fact that $\ME{t}\leq \MA{t}$ and using the upper bound \eqref{induct}, the excess error in \eqref{gen bound} is upper bounded by $tB=(t-1)B+B$ to obtain
\begin{align}
{\cal{L}}(h^t,\phi_{\text{new}}^t+\phi_{\text{frz}}^t)-\Lci{t}\leq \MS{t}+t (\eps_0+\sqrt{\frac{\ordet{\cc{{\mtx{H}}}+\mathcal{C}^{{\boldsymbol{\Phi}}}_{\text{new}}+\tau}}{N}}).\label{bound fin gen}
\end{align}
 To conclude, we simply sum up \eqref{bound fin gen} for $1\leq t\leq T$ to obtain the advertised bound where the total excess finite-sample error grows as $\frac{T(T+1)}{2}B\leq T^2B$. This yields \eqref{gen bound 3}. \eqref{gen bound 4} is identical to \eqref{gen bound 3} via \eqref{mst}.
\end{proof}


Following Def.~\ref{def pop seq} and Assumption \ref{fin comp}, the lemma below probabilistically quantifies the generalization risk when we add one task. Using this lemma, we will state our main result which quantifies the generalization risk when adding $T$ tasks.
\begin{lemma}\label{lem seq} Suppose we are given the output pairs $(h^\tau,\phi_{\text{new}}^\tau)_{\tau=1}^{t-1}$ of the first $t-1$ applications of sequential CRL problem \eqref{crlseq}. Now, we solve for the $t$'th solution denoted by the pair $h^t,\phi_{\text{new}}^t$. Under same conditions as in Theorem \ref{cl thm} (Lipschitz hypothesis ${\boldsymbol{\Phi}}_{\text{new}}^t,{\mtx{H}}$, Lipschitz loss $\ell:\mathbb{R}\times \mathbb{R}\rightarrow [0,1]$ and bounded input features $\mathcal{X}$), for some absolute constant $C>0$, with probability $1-2e^{-\tau}$, the solution $(h^t,\phi_{\text{new}}^t)$ of \eqref{crlseq} satisfies the following two properties
\begin{align}\label{eq mm}
&{\cal{L}}(h^t,\phi_{\text{new}}^t+\phi_{\text{frz}}^t)-\Lci{t}\leq \MS{t}+\ME{t}+\sqrt{\frac{\ordet{\cc{{\mtx{H}}}+\cc{{\boldsymbol{\Phi}}_{\text{new}}^t}+\tau}}{N}}\\
&\MA{t}\leq \sqrt{\frac{\ordet{\cc{{\mtx{H}}}+\cc{{\boldsymbol{\Phi}}_{\text{new}}^t}+\tau}}{N}}+\ME{t}.\nonumber
\end{align}
Here, $\MS{t},\ME{t},\MA{t}$ are mismatch definitions introduced in Def.~\ref{def pop seq} and Assumption \ref{fin comp}.
\end{lemma}
\begin{proof}
As introduced in \eqref{ltseq}, the representation quality of the new task will be captured by
\[
{\cal{L}}^t_\text{seq}=\min_{h\in{\mtx{H}},\phi_{\text{new}}\in{\boldsymbol{\Phi}}_{\text{new}}^t}{\cal{L}}_t(h,\phi)~\text{s.t.}~\phi=\phi_{\text{new}}+\phi_{\text{frz}}^t.
\]
Similarly, recall the empirical mismatch $\ME{t}={\cal{L}}^t_\text{seq}-\Lci{t}_\text{seq}$. 
Based on this and recalling $\MS{t}$ definition \eqref{mst}, applying Theorem \ref{cl thm} with $T=1$, we obtain that, with probability $1-2^{-\tau}$ we have the following bounds (on the same event)
\begin{align}
{\cal{L}}(h^t,\phi_{\text{new}}^t+\phi_{\text{frz}}^t)&\leq{\cal{L}}^t_\text{seq}+\sqrt{\frac{\ordet{\cc{{\mtx{H}}}+\cc{{\boldsymbol{\Phi}}_{\text{new}}^t}+\tau}}{N}}\nonumber\\
{\cal{L}}(h^t,\phi_{\text{new}}^t+\phi_{\text{frz}}^t)&\leq \Lci{t}+\MS{t}+\ME{t}+\sqrt{\frac{\ordet{\cc{{\mtx{H}}}+\cc{{\boldsymbol{\Phi}}_{\text{new}}^t}+\tau}}{N}}.\nonumber
\end{align}
The second equation establishes \eqref{eq mm}.
The remaining challenge is simply relating the population and empirical mismatches i.e.~controlling $\ME{t}$. Recall $h^t$ is the ERM solution of \eqref{crlseq} associated with $\phi_{\text{new}}^t$. Using the uniform concentration event (implied within the application of Theorem \ref{cl thm} via \eqref{unif conv}), we have
\[
|{\widehat{\cal{L}}}_{\mathcal{S}_t}(h,\phi_{\text{new}}+\phi_{\text{frz}}^t)-{\cal{L}}_t(h,\phi_{\text{new}}+\phi_{\text{frz}}^t)|\leq \sqrt{\frac{\ordet{\cc{{\mtx{H}}}+\cc{{\boldsymbol{\Phi}}_{\text{new}}^t}+\tau}}{N}},
\]
Now, using the optimality of $h^t$ for $\phi_{\text{new}}^t$ and the fact that $(h^t,\phi_{\text{new}}^t)$ minimizes the empirical risk, observe that
\begin{align*}
\MA{t}&=\min_{h\in{\mtx{H}}}{\cal{L}}(h,\phi_{\text{frz}}^{t+1})-\Lci{t}_\text{seq}\\
&\leq [\min_{h\in{\mtx{H}}}{\cal{L}}(h,\phi_{\text{frz}}^{t+1})-{\cal{L}}^t_\text{seq}]+\ME{t}\\
&\leq [{\cal{L}}(h^t,\phi_{\text{frz}}^{t+1})-{\cal{L}}^t_\text{seq}]+\ME{t}\\
&\leq \sqrt{\frac{\ordet{\cc{{\mtx{H}}}+\cc{{\boldsymbol{\Phi}}_{\text{new}}^t}+\tau}}{N}}+\ME{t},
\end{align*}
to conclude with the second line of \eqref{eq mm}.
\end{proof}


\section{Introduction}





\emph{Continual learning} (CL) or lifelong learning aims to build a model for a non-stationary and never-ending sequence of tasks, without access to previous or future data \cite{thrun1998lifelong,chen2018lifelong,parisi2019continual}. The main challenge in CL is that a standard learning procedure usually results in performance reduction on previously trained tasks if their data are not available. The phenomenon is termed as \emph{catastrophic forgetting} \cite{mccloskey1989catastrophic,kirkpatrick2017overcoming,pfulb2019comprehensive}.

Importance of continual learning in real-life inference and decision making scenarios led to a rich set of CL techniques~\cite{delange2021continual,van2019three}\ylm{methods including replay-based, regularization-based, and architecture-based strategies}. However, the theoretical principles of continual learning is relatively less understood and the progress is under-whelming compared to the algorithmic advances despite recent progress {(see \cite{bennani2020generalisation,doan2021theoretical,lee2021continual})}. In this work, for the first time, we investigate the problem of \emph{Continual Representation Learning (CRL)} to answer

\begin{center}
\fbox{\centering\begin{minipage}{0.84\textwidth}
\centering\textit{
What are the statistical benefits of previous feature representations for learning a new task? Can we build an insightful theory explaining empirical performance?}
\end{minipage}}
\end{center}


Our key contribution is addressing both questions affirmatively. We develop our algorithms and theory for architecture-based zero-forgetting CL which includes  PackNet~\cite{mallya2018packnet}, CPG~\cite{hung2019compacting}, and RMN~\cite{wu2020understanding}. These methods eliminate forgetting by training a sub-network for each task and freezing the trained parameters. At a high level, all of these methods inherently have the potential of continual representation learning by allowing new tasks to reuse the frozen feature representations built for earlier tasks. However, quantifying the empirical benefits/performance of these representations and building the associated theory have been elusive. We overcome this via innovations in experiment design, theory, and algorithms:







\noindent$\bullet$ \textbf{Theoretical and empirical benefits of continual representations.} We establish theoretical results and sample complexity bounds for CRL by using tools from empirical process theory. Within our model, a new task uses frozen feature map $\phi_{\text{frz}}$ of previous tasks and learns an additional task-specific representation $\phi_{\text{new}}$. For PackNet \cite{mallya2018packnet}, $\phi_{\text{frz}}$ corresponds to all the nonzero weights so far and $\phi_{\text{new}}$ is the nonzeros allocated to the new task, thus $\phi_{\text{frz}}$ requires a lot more data to learn.  Our theory (see Section~\ref{sec:crl_theory}) explains (1) how the new task can reuse the frozen feature map to greatly reduce the sample size and (2) how to quantify the representational compatibility between old and new tasks. Specifically, we fully avoid the statistical cost of learning $\phi_{\text{frz}}$ and replace it with a \emph{``representational mismatch''} term between the new task and frozen features. }%\textcolor{blue}{We also extend our results from a single task to learning a sequence of tasks by quantifying the aggregate impact of finite data on representation $\phi_{\text{frz}}$ which evolves as new tasks arrive (deferred to appendix)}.

An important conclusion is that ideally frozen representation $\phi_{\text{frz}}$ should contain \textbf{diverse and high-quality features} so that it has small \emph{mismatch} with new tasks. This is consistent with the results on transfer learning \cite{tripuraneni2020theory} and motivates the following CL principle supported by our experiments:

\begin{center}
\hspace{-10pt}\fbox{\begin{minipage}{0.84\linewidth}
\centering\textit{
First learn diverse and large-data tasks so that their representations help upcoming tasks.}
\end{minipage}}
\end{center}


 Indeed, we show in Fig.~\ref{fig:LS} that a new task with small data achieves significantly higher accuracy if it is added later in the task sequence (so that $\phi_{\text{frz}}$ becomes more diverse). Then, we show in Fig.~\ref{fig:diversity} that choosing the first task to be diverse (such as ImageNet) helps all the downstream tasks. Finally, we show in Fig.~\ref{fig:Sample} that it is better to first learn tasks with large sample sizes to ensure high-quality features. 
\red{Our results on the importance of task order also relate to curriculum learning \cite{bengio2009curriculum} where the agent gets to choose the order tasks are learned. However, instead of curriculum learning, which aims to learn one task from easy to hard, our work is based on a more general continual learning setup. Our conclusion on task diversity provides a new perspective on the learning order of curriculum learning that we should first learn more representative tasks instead of an easier task. }











\noindent$\bullet$ \textbf{Algorithms for inference-efficient CRL.} Zero-forgetting CL methods often need a large neural network (dubbed as supernetwork) to load numerous tasks into subnetworks, or dynamically expand the model to avoid forgetting \cite{delange2021continual}. Thus, CL/CRL may incur a large computational cost during inference. This leads us to ask whether one can retain the accuracy benefits of CRL while ensuring that each new task utilizes an \textbf{inference-efficient} sub-network, thus achieving the best of both worlds. We quantify inference-efficiency via floating point operations (FLOPs) required to compute the task subnetwork. To this end, we propose Efficient Sparse PackNet (ESPN) algorithm (Fig.~\ref{overview fig}). In a nutshell, ESPN guarantees inference-efficiency by incorporating a channel-pruning stage within PackNet-style approaches. Via extensive evaluations, we find that, ESPN incurs minimal loss of accuracy while greatly reducing FLOPs (as much as 80\% in our SplitCIFAR-100 experiments, Table \ref{CIFARtable}).












{In summary, this work makes key contributions to continual representation learning from empirical, theoretical, and algorithmic perspectives. In the remainder of the paper, we discuss related work, detail our empirical and theoretical findings on CRL, and present the ESPN algorithm and its evaluations.}
















\section{Efficient Sparse PackNet (ESPN) Algorithm}\label{sec: espnalgo}
 We will use PackNet and ESPN throughout the paper to study continual representation learning. Thus, we first introduce the high-level idea of our ESPN algorithm which augments PackNet. Suppose we have a single model referred to as supernetwork and a sequence of tasks. Our goal is to train and find optimal sparse sub-networks within the supernet that satisfy both FLOPs and sparsity restrictions without any performance reduction or forgetting of earlier tasks. Figure~\ref{overview fig} illustrates our proposed algorithm. We propose a joint channel and weight pruning strategy in which FLOPs constraint (in channel pruning) is important for efficient inference and sparsity constraint (in weight pruning) is important for \emph{packing} all tasks into the network even for a large number of tasks $T$. In essence, ESPN equips PackNet-type methods with inference-efficiency using an innovative FLOP-aware channel pruning strategy. }%\textcolor{blue}{Section \ref{sec: espn_detail} and appendix provide implementation details and inference-time evaluations on ESPN.

\noindent{\textbf{Notation.}} ESPN admits FLOPs constraint as an input parameter, which is a critical aspect of the experimental evaluations. Let MAX\_FLOPs be the FLOPs required for one forward propagation through the dense supernetwork. In our evaluations, for $\gamma\in (0,1]$, we use \textbf{ESPN-$\gamma$} to denote our CL Algorithm \ref{algo 1} in which each task obeys the FLOPs constraint $\gamma\times \text{MAX\_FLOPs}$. Similarly, \textbf{Individial-$\gamma$} will be the baseline that each task is trained individually (from scratch) on the full supernetwork while using at most $\gamma\times \text{MAX\_FLOPs}$. \textbf{Individual} is same as \textbf{Individual}-1 where we train the whole network without pruning.



 

\section{Conclusion and Discussion}

To summarize, our work elucidates the benefit of continual representation learning theoretically and empirically and sheds light on the role of task ordering, diversity and sample size. We also propose a new CL algorithm to learn good representations subject to inference considerations. Extensive experimental evaluations on the proposed Efficient Sparse PackNet demonstrate its ability to achieve good accuracy as well as fast inference.


\noindent \textbf{Limitations and future directions.} 
Although we highlight the importance of task order in CRL, the task sequence is not always under our control. It would be desirable to develop adaptive learning schemes that can better identify an exploit diverse tasks and discover semantic connections across task pairs even for a predetermined task sequence. 
Another potential direction is to develop similar inference-efficient continual learning schemes for other architectures by appropriately adapting our joint weight and channel pruning strategy. An example is transformer-based models where computation and memory efficiency is particularly critical.

\section*{Acknowledgements}\vspace{-10pt}
This work is supported in part by the National Science Foundation under grants CCF-2046816, CCF-2046293, CNS-1932254, and by the Army Research Office under grant W911NF-21-1-0312.

















\section{Inference-efficient continual representations via Efficient Sparse PackNet (ESPN)} \label{sec: espn_detail}


In this section, we provide an in depth discussion of our ESPN algorithm and evaluate its benefit in inference efficiency.
ESPN (Algorithm~\ref{algo 1}) learns the new sub-network in three phases: pre-training (over all free weights), gradual pruning (to prune weights and channels), and fine-tuning (to refine the new allocated weights). While not shown in Algorithm~\ref{algo 1}, we also introduce the following innovations.

\input{sec/algo_oc}

\noindent$\bullet$ \textbf{FLOP-aware pruning.} 
We propose a FLOP-aware pruning strategy that provides up to 80\% FLOPs reduction in our experiments. In practice, Algorithm~\ref{algo 1} minimizes a regularized objective ${\cal{L}}_{\mathcal{S}_t}({\boldsymbol{\theta}}\odot\vct{m})+\mathcal{R}({\boldsymbol{\Gamma}},\vct{m})$, with
\begin{align}
	\label{eq:reg}\mathcal{R}({\boldsymbol{\Gamma}},\vct{m})&=\sum_{l\in [L]}\lambda_l\|{\boldsymbol{\Gamma}}_l\|_1,~~\lambda_l=g(\text{FLOP}_l(\vct{m})).

\end{align}
${\boldsymbol{\Gamma}}_l$ denotes the channel scores of $l^{\text{th}}$ layer, $L$ is the number of layers, and $g(\cdot)$ is a monotonically increasing function. In this manner, channels costing more FLOPs are assigned larger $\lambda_l$, pushed towards zero, and subsequently pruned. Additionally, $\text{FLOP}_l(\cdot)$ computes FLOPs of $l^{\text{th}}$ layer. Since we use gradual pruning (Line~\ref{algo:pruning} in Algorithm~\ref{algo 1}),  FLOPs are changed over time and $\lambda_l$ is automatically tuned. In our experiments, we use $g(x)=C\sqrt{x}$ for a proper scaling choice $C>0$. }%\textcolor{blue}{The evaluation of our FLOP-aware channel pruning algorithm is presented in \ylm{Section~\ref{pruning sec} of} appendix.}





\som{discuss $\alpha$ better}
\noindent$\bullet$ \textbf{Weight allocation.} 
{Inspired by our theoretical finding that CL improves data efficiency,} we introduce a simple weight allocation scheme to assign free weights for new task. Let $p$ be the total number of weights and $p_{t-1}$ be the number of used weights after training tasks $1$ to $t-1$. Then we assign $\lceil(p-p_{t-1})\cdot\alpha\rceil$ free weights to task $t$, where $0<\alpha<1$ is {the \emph{weight-allocation} level}. We emphasize that weight allocation controls the number of new nonzeros allocated to a task. A new task is allowed to use all of the (frozen) nonzeros that are allocated to the previous tasks (as long as FLOPs constraint is not violated). }%\textcolor{blue}{We evaluate the empirical performance of our weight allocation technique \ylm{in Section~\ref{sec:wa}} in appendix.}




\subsection{Experimental Setup}\label{sec:setting}
We evaluate the performance of our proposed ESPN in terms of accuracy and efficiency metrics on three datasets: SplitCIFAR100, RotatedMNIST, and PermutedMNIST. We compare ESPN to numerous baselines, which include training each task individually, multitask learning (MTL), PackNet~\cite{mallya2018packnet}, CPG~\cite{hung2019compacting}, RMN~\cite{kaushik2021understanding}, and SupSup~\cite{wortsman2020supermasks}. The last four methods are zero-forgetting CL methods. \som{Emphasize SUPSUP is learning full-dimensional parameter mask}
\som{Emphasize no one is efficient}
\som{Empirically verifying our FLOP constraint vs theirs}




\noindent{\textbf{Datasets. }} SplitCIFAR100, RotatedMNIST and PermutedMNIST are popular datasets for continual learning that we also use in our experiments. We follow the same setting as in \cite{wortsman2020supermasks}. For SplitCIFAR100 dataset, we randomly split CIFAR100 \cite{krizhevsky2009learning} into 20 tasks where each task contains $5$ classes, $2500$ training samples, and $500$ test samples. RotatedMNIST is generated by rotating all images in MNIST by the same degree. In our experiments, we generate $36$ tasks with $10,20,\ldots, 360$ degree rotations { and train in a random order}. PermutedMNIST dataset is created by applying a fixed pixel permutation to all images, and we created $36$ tasks with independent random permutations. 

\input{sec/table_oc}

\noindent{\textbf{Models and implementation. }} For SplitCIFAR100, {following \cite{lopez2017gradient} we use a variation of ResNet18 model with fewer channels.} For each task, we use a batch size of 128 and Adam optimizer \cite{kingma2014adam} with hyperparameters $(\beta_1,\beta_2)=(0,0.999)$. As shown in Algorithm~\ref{algo 1}, for each task we apply pretraining, pruning, and finetuning strategies. First, we pretrain the model for $60$ epochs with learning rate $0.01$. Then, we gradually prune the channels and weights within $90$ epochs using the same learning rate. For the finetuning stage, we apply cosine decay \cite{loshchilov2016sgdr} over learning rate, starting from $0.01$, and train for $100$ epochs. Therefore, each task trains for $250$ epochs in total. 

For RotatedMNIST and PermutedMNIST experiments, we use the same setting }%\textcolor{blue}{of FC1024 model} in \cite{wortsman2020supermasks}. This is a fully connected network with two hidden layers of size $1024$. We train for $10$ epochs with RMSprop optimizer \cite{graves2013generating}, batch size of $256$, and learning rate $0.001$.  
The number of pretraining, pruning, and finetuning epochs are $3$, $4$, and $3$, respectively.




As for comparison baselines, MTL and individual training baselines follow the same configurations (e.g.~architecture, hyperparameters) as ESPN. In SplitCIFAR100 experiments, labels are not shared among different tasks. Thus, each task is assigned a separate classifier head while sharing the same backend supernet as a feature extractor. {Unlike SplitCIFAR100, all the tasks in RotatedMNIST and PermutedMNIST experiments share the same head via weight pruning.} MTL simultaneously trains all 20 tasks while sharing the backend supernet. In individual training, each task trains their own backend. Finally, CPG, RMN, and SupSup are all trained with their own publicly available codes but over the same ResNet18 model. To ensure fair comparison, we do not allow dynamic model expansion/enlargement in CPG. Similarly, for PackNet, instead of using its original setting with fixed pruning ratio for each task, we run it with our \emph{weight allocation} strategy to make it easier to compare. These baselines are all evaluated in Table~\ref{CIFARtable} which is discussed below.


\subsection{Investigation of Inference Efficiency}




\noindent{\textbf{SplitCIFAR100.}} Table~\ref{CIFARtable} presents results for our experiments on SplitCIFAR100. We use our FLOP-aware pruning technique to prune the channels and apply weight allocation with parameter $\alpha=0.1$. 
}%\textcolor{blue}{Since the impact of classifier head is rather negligible as it contains $<0.1\%$ weights and $<0.01\%$ FLOPs of the backend, we evaluate the FLOPs for each task inside backend.} To compare the performance of different methods, we use the same random seed to generate $20$ tasks so that task sequence is exactly the same over all experiments. We conduct ESPN experiments with $20\%$, $50\%$}%\textcolor{blue}{, and $100\%$} of FLOPs constraints corresponding to ESPN-0.2, ESPN-0.5}%\textcolor{blue}{, and ESPN-1} in Table~\ref{CIFARtable}.


The results of Individual-0.2/-1 show that our FLOP-aware pruning technique can effectively reduce the computation requirements while maintaining the same level of model accuracy.  Moreover, PackNet performing better than both CPG and RMN shows the benefit of our weight allocation technique. }%\textcolor{blue}{We remark that CPG method performs pruning and fine-tuning multiple times until it finds the optimal pruning ratio, which costs significantly more time in training compared to PackNet{/ESPN}.}  ESPN-0.2/-0.5 results show that our continual learning algorithm outperforms both baselines in accuracy despite up to 80\% FLOP reduction. \ylm{Our accuracy improvement over PackNet arises from the trainable task-specific BatchNorm weights described under Algorithm \ref{algo 1}.}
\ylm{In practice, for all CPG, RMN, PackNet, and SupSup methods, task-specific running mean and running variance inside BatchNorm layers are essential to reconstruct the same performance. Therefore, despite our method applies additional BatchNorm weights for each task, since we prune BatchNorm layers and less replay memory is needed overall compared to the other methods (except ESPN-1).}








\noindent\textbf{RotatedMNIST and PermutedMNIST.} Figure~\ref{fig:mnist} presents the RotatedMNIST and PermutedMNIST results. We run experiments on ESPN, PackNet, and individual training and report the average accuracy over $5$ trials. Note that, for fully-connected layers, we prune neurons to reduce FLOPs (rather than channels of CNN). In our experiments, we assigned neurons with a pruning parameter ${\boldsymbol{\Gamma}}$. We use our FLOP-aware pruning technique to prune neurons based on ${\boldsymbol{\Gamma}}$ and use weight allocation with parameter $\alpha=0.05$. \ylm{${\boldsymbol{\Gamma}}$ is released after pruning.} Similar to channel-sharing in Figure \ref{overview fig}, different tasks are allowed to share the same neuron. Unlike SplitCIFAR100 experiments, all the tasks in both MNIST experiments share the same classifier head because they use the same 10 classes }%\textcolor{blue}{and FLOPs evaluation includes this head}. }%\textcolor{blue}{Therefore, \ylm{same as other zero-forgetting methods,} only binary mask is needed to restore performance.}

\input{sec/fig/mnist_oc}

Results of RotatedMNIST and PermutedMNIST experiments are shown in Figure~\ref{fig:mnist}. Blue curves show the results of ESPN-0.2. For fair comparison, PackNet (Orange curves) use the same weight allocation parameter $\alpha=0.05$. The Green and Red dashed lines show the task-averaged accuracy of Individual-0.2/-1 baselines that train each task with separate models for $20\%/100\%$ of FLOPs constraints. }%\textcolor{blue}{In both experiments, the gap between Individual-0.2 (Green) and Individual-1 (Red) curves is negligible and it again shows the benefit of our FLOP-aware pruning technique. While PackNet performs well for the first few tasks, it degrades gradually as more tasks arrive. Thus, when there are more than a few tasks, we see that ESPN algorithm works better than PackNet despite enforcing inference-efficiency and despite using the same weight allocation method. Figure \ref{fig:rotated} shows that ESPN-0.2 on RotatedMNIST exhibits mild accuracy degradation over $36$ tasks.} While in Figure \ref{fig:permuted}, test accuracy decreases more noticeably as the tasks are added. A plausible explanation is that, because each PermutedMNIST task corresponds to a random permutation, the tasks are totally unrelated. Thus, knowledge gained from earlier tasks are not useful for training new tasks and CRL does not really help in this case. In contrast, since tasks in RotatedMNIST are semantically relevant, ESPN and PackNet both achieve higher accuracy thanks to representation reuse across tasks. Specifically, the significant accuracy gap between the Blue curves in Figures \ref{fig:rotated} and \ref{fig:permuted} (especially for larger Task IDs) demonstrate the clear benefit of CRL.


\section{Efficient Sparse PackNet (ESPN) Algorithm}\label{appsec: espnalgo}
In this section, we introduce more implementation and evaluation details of our ESPN algorithm. Denote $[p]=\{1,2,\dots,p\}$. We use the phrases \emph{mask} and \emph{sub-network} interchangeably because we obtain the task sub-network by masking weights of the supernet. This sub-network is the nonzero support of the task, that is, the task-specific model is obtained by setting other weights to zero. We assume a sequence of tasks $\{\mathcal{T}_t$, $1\leq t\leq T\}$ are received during training time, where $t$ is task identifier and $T$ is the number of tasks. Let $\mathcal{S}_t=\{(\vct{x}_{ti},y_{ti})\}_{i=1}^{N_t}$ be labeled training pairs of $\mathcal{T}_t$, which consists of $N_t$ samples. 
Given task sequence and a single model, our goal is to find optimal sparse sub-networks that satisfy both FLOPs and sparsity restrictions without performance reduction and knowledge forgetting. FLOPs constraint is important for efficient inference whereas sparsity constraint is important for adding all tasks into the network even for large number of tasks $T$. {To this end, we use joint channel and weight pruning strategy.} Let $\ell$ be a loss function, $f$ be a hypothesis (prediction function) and ${\boldsymbol{\theta}}\in\mathbb{R}^p$ denote the weights of $f$. We focus on task $\mathcal{T}_t$, assuming $\mathcal{T}_1,...,\mathcal{T}_{t-1}$ are already trained. Let $\vct{m}_t\subset[p]$ be the nonzero support of task $t$ and $\vct{m}^{\text{all}}_t=\cup_{\tau=1}^t \vct{m}_\tau$ be the combined support until task $t$. Initially $\vct{m}^{\text{all}}_0=\emptyset$. Let ${\boldsymbol{\theta}}_t\in\mathbb{R}^p$ be the model weights at time $t$. Note that all the trained weights of the tasks lie on the sub-network $\vct{m}^{\text{all}}_t$. We use the notation ${\boldsymbol{\theta}}\odot\vct{m}$ to set the weights of ${\boldsymbol{\theta}}$ outside of the mask $\vct{m}$ to zero. 
Define the loss ${\cal{L}}_{\mathcal{S}_t}({\boldsymbol{\theta}})=\frac{1}{N_t}\sum_{i=1}^{N_t}\ell(f(\vct{x}_{ti};{\boldsymbol{\theta}}),y_{ti})$. The procedure for learning task $\mathcal{T}_t$ is formulated as the following optimization task that updates {supernet} weight/mask pair and {returns $({\boldsymbol{\theta}}_t,\vct{m}^{\text{all}}_t:=\vct{m}_t\cup \vct{m}^{\text{all}}_{t-1})$} given $({\boldsymbol{\theta}}_{t-1},\vct{m}^{\text{all}}_{t-1})$:
\begin{align}
    {\boldsymbol{\theta}}_t,\vct{m}_t=\arg&\min_{{\boldsymbol{\theta}},\vct{m}}~~~{\cal{L}}_{\mathcal{S}_t}({\boldsymbol{\theta}}\odot\vct{m})\tag{ESPN-OPT}\label{cl_prob}\\
    \text{s.t.} 
    &~~~\text{FLOP}(\vct{m})\leq \overline{\text{FLOP}}_t,\nonumber\\
    &~~~\vct{m}_{\text{new}}=\vct{m}\setminus\vct{m}^{\text{all}}_{t-1},\nonumber\\
    &~~~\text{NNZ}(\vct{m}_{\text{new}})\leq \overline{\text{NNZ}}_t,\nonumber\\
    &~~~{{\boldsymbol{\theta}}\odot\vct{m}^{\text{all}}_{t-1}}={\boldsymbol{\theta}}_{t-1}\odot\vct{m}^{\text{all}}_{t-1}.\nonumber
\end{align}
Here $\vct{m}_t$ is the channel-constrained mask that corresponds to the sub-network of $\mathcal{T}_t$ and ${\boldsymbol{\theta}}_t\odot\vct{m}_t$ are the weights we use for task $\mathcal{T}_t$ and the prediction function is $f(\cdot;{\boldsymbol{\theta}}_t\odot\vct{m}_t)$. {The updated mask until task $t$ is obtained by $\vct{m}^{\text{all}}_t=\vct{m}_t\cup \vct{m}^{\text{all}}_{t-1}$.} $\text{FLOP}(\cdot)$ and $\text{NNZ}(\cdot)$ returns the FLOPs and nonzeros of a given mask $\vct{m}$. $\overline{\text{FLOP}}_t$ and $\overline{\text{NNZ}}_t$ are the FLOPs and nonzero constraints of task $t$. Observe that we only enforce NNZ constraint on the new weights $\vct{m}_{\text{new}}$ whereas FLOPs constraint applies on the whole sub-network. {The last equation in \eqref{cl_prob} highlights that the weights of earlier tasks on $\vct{m}^{\text{all}}_{t-1}$ are kept frozen}. 
\som{Discuss algo in more detail} In practice, FLOPs \& NNZ constraints in \eqref{cl_prob} lead to a combinatorial problem. We propose Algorithm \ref{algo 1} to (approximately) solve this problem efficiently which learns the new sub-network in three phases: pre-training (over all free weights), gradual pruning (to satisfy constraints and obtain $\vct{m}_{\text{new},t}$), and fine-tuning (to refine the weights on $\vct{m}_{\text{new},t}$). While not shown in Algorithm \ref{algo 1}, we also introduce the following innovations.

\noindent$\bullet$ \textbf{Trainable task-specific BatchNorm.} 
We train separate BatchNorm layers for each task. This has multiple synergistic benefits. First, our algorithm trains faster and generalizes better than PackNet which does not train BatchNorm weights (see Table \ref{CIFARtable}). Specifically, training BatchNorm weights allows ESPN to re-purpose the (frozen) weights of the earlier tasks with negligible memory cost. BatchNorm weights also guide our channel pruning scheme described next.

\noindent$\bullet$ \textbf{FLOP-aware pruning.} Many of the prior works on channel pruning \cite{liu2017learning,zhuang2020neuron} focus on reducing the number of channels rather than the computation/FLOPs cost of the channels which varies across layers. As \red{shown in Fig.~\ref{fig:CIFAR10_ratio} that to satisfy FLOPs constraint $\gamma$, numerous channels are pruned}. This results in unsatisfactory performance especially under aggressive FLOPs constraint \red{(shown in Fig.~\ref{fig:CIFAR10_acc})}. However in CL setup, in order to train a single model with many tasks, a significantly larger supernet with high capacity is needed compared to a single task requirement, and we aim to find a sub-network with very few FLOPs without compromising performance. To fit our specific needs for channel pruning, in this paper we present an innovative channel pruning algorithm called {FLOP-aware channel pruning} that preserves the performance up to 80\% FLOPs reduction in our SplitCIFAR100 experiments (Table \ref{CIFARtable}). Following \cite{liu2017learning}, we consider BatchNorm weights as a trainable saliency score for convolutional channels and prune all channels with scores lower than a certain threshold by setting them to zero. Let ${\boldsymbol{\Gamma}}$ be the BatchNorm vectors. Given dataset $\{(\vct{x}_i,y_i)\}_{i=1}^N$, in practice, rather than solving the constrained problem \eqref{cl_prob}, Algorithm~\ref{algo 1} minimizes a regularized objective ${\cal{L}}_{\mathcal{S}_t}({\boldsymbol{\theta}}\odot\vct{m})+\mathcal{R}({\boldsymbol{\Gamma}},\vct{m})$. Here $\mathcal{R}$ is a regularization term to promote channel sparsity in \eqref{app eq:reg}.




For an $L$ layer network, use ${\boldsymbol{\Gamma}}_l$ to denote the $l^{\text{th}}$ {BatchNorm weights} for $l\in[L]$. Intuitively, we wish to use $\ell_1$-regularization on ${\boldsymbol{\Gamma}}$. However, since layers of the network show variation, layerwise regularization parameters $(\lambda_l)_{l=1}^L$ are needed where $L$ is number of layers. Instead of designing $\lambda_l$ by trial-and-error -- which is time-consuming and expert-dependent -- we introduce a method that chooses $\lambda_l$ automatically, fine-tunes $\lambda_l$ during gradual pruning and adapts to the global FLOPs constraint. 
 Specifically, $\lambda_l$ is chosen based on the {FLOPs load of a channel} determined by its input feature dimensions and operations. Here an implicit goal is pruning as few channels as possible while achieving maximum FLOPs reduction. To achieve these goals, we use the following FLOP-weighted sparsity regularization (as also stated in the main body)
 
 \begin{align}
	\label{app eq:reg}\mathcal{R}({\boldsymbol{\Gamma}},\vct{m})&=\sum_{l\in [L]}\lambda_l\|{\boldsymbol{\Gamma}}_l\|_1,~~\lambda_l=g(\text{FLOP}_l(\vct{m}^{(l)})).

\end{align}
Here $\vct{m}^{(l)}$ denotes the restriction of the sub-network $\vct{m}$ to $l^{\text{th}}$ layer, $\text{FLOP}_l(\cdot)$ is the FLOPs load for a channel in the $l^{\text{th}}$ layer of the subnet, and $g(\cdot)$ is a monotonically increasing function. We use $\ell_1$-penalty to enforce unimportant elements to zeros and prune the channels with smallest weights over all layers. Since $g(\cdot)$ is increasing, channels costing more FLOPs are assigned with larger $\lambda_l$ and are pushed towards zero, thus they are easier to prune. Additionally, since $\text{FLOP}_l(\cdot)$ is based on subnet $\vct{m}^{(l)}$, $\lambda_l$ is automatically tuned while we use gradual pruning (Line~\ref{algo:pruning} in Algorithm \ref{algo 1}). In our experiments, we use $g(x)=C\sqrt{x}$ for a proper scaling choice $C>0$. \red{Here $C$ can be seen as a normalized term and in detail we have
\begin{align*}
    \lambda_l=\frac{\sqrt{\text{FLOP}_l(\vct{m}^{(l)})}}{\sum_{i=1}^L\sqrt{\text{FLOP}_i(\vct{m}^{(i)})}}.
\end{align*}}





\noindent$\bullet$ \textbf{Weight allocation.} Since we do not modify the supernet architecture, without care, supernet might run out of free weights if there is a huge number of tasks. While original PackNet paper \cite{mallya2018packnet} also uses weight pruning, since they consider relatively fewer tasks, they don't develop an algorithmic strategy for allocating the free weights to new tasks. In our experiments, we introduce a simple weight allocation scheme to assign $\overline{\text{NNZ}}_t$ {depending on the number of remaining free weights}. Let $p$ be the total number of weights, $p_t$ be the total number of weights used by tasks $1$ to $t$ and $p_0=0$. We set 
\begin{align}\label{wa-eq}
\overline{\text{NNZ}}_t=\lceil (p-p_{t-1})\cdot\alpha\rceil\quad\text{for some}\quad 0<\alpha<1.
\end{align} Here $\alpha$ is {the \emph{weight-allocation} level} and a new task gets to use $\alpha$ fraction of all unused weights in the supernet. We emphasize that weight allocation controls the number of new nonzeros allocated to a task. A new task is allowed to use all of the (frozen) nonzeros that are allocated to the previous tasks (as long as FLOPs constraint is not violated).





\section{Appendix C: Expanded Related Work}
\section{Expanded Related Work}\label{app:related}


Our contributions are most closely related to the continual learning literature. Our theory and algorithms also connect to representation learning and neural network pruning.

\noindent\textbf{Continual learning.}
A number of methods for continual and lifelong learning have been proposed to tackle the problem of catastrophic forgetting and existing approaches can be broadly categorized into three groups~\cite{delange2021continual}: replay-based~\cite{lopez2017gradient,rebuffi2017icarl,rolnick2018experience,buzzega2021rethinking,aljundi2019gradient}, regularization-based~\cite{kirkpatrick2017overcoming,zenke2017continual,li2017learning}, and parameter isolation methods~\cite{fernando2017pathnet,yoon2017lifelong,mallya2018piggyback,rusu2016progressive}. In our work, we focus on a branch of parameter isolation methods called zero-forgetting CL such as PackNet~\cite{mallya2018packnet}, CPG~\cite{hung2019compacting}, RMN~\cite{kaushik2021understanding}, and SupSup~\cite{wortsman2020supermasks} that completely eliminates forgetting by training a sub-network for each task and freezing the trained parameters (SupSup excepted). However, finding of \cite{frankle2018lottery} shows that a network can reduce by over 90\% of parameters without performance reduction. This inspires our weight-allocation strategy to adapt PackNet to more sparse sub-networks. Unlike PackNet which prunes the network by keeping largest absolute weights in each layer and reuses all the frozen weights, CPG and RMN apply real-valued mask to each fixed entry and prune by keeping the largest values of the mask. SupSup is motivated by \cite{ramanujan2020s,zhou2019deconstructing,malach2020proving} which show that a sufficiently over-parameterized random network contains a sub-network with roughly the same accuracy as the target network without training. In essence, it aims to find masks only over random network and it is adaptable to infinite tasks. However it leads to inefficient inference (due to using the full network) and potentially large memory requirements (as one has to store a mask as large as supernet rather than a subnet).

In CL, in order to load a network with a large number of tasks, often, a large model is needed, which naturally leads to inefficiency during inference-time without proper safeguards. Addressing this challenge appears to be an unexplored avenue as far as we are aware. While in this work, we present Efficient Sparse PackNet (ESPN) that implements zero-forgetting CL and achieves state-of-the-art accuracy with less computational demand.

We also emphasize that there are several interesting works on the theory of continual learning such as \cite{lee2021continual, doan2021theoretical,bennani2020generalisation,mirzadeh2020understanding,yin2020optimization}. These works focus on NTK-based analysis for deep nets, theoretical investigation of orthogonal gradient descent \cite{farajtabar2020orthogonal}, and task-similarity. However, to the best of our knowledge, ours is the first work on the representation learning ability and the associated data-efficiency.



\noindent\textbf{Representation learning theory.} The rise of deep learning motivated a growing interest in theoretical principles behind representation learning. Similar in spirit to this project, \cite{maurer2016benefit} provides generalization bounds for representation-based transfer learning in terms of the Rademacher complexities associated with the source and target tasks. Some of the earliest works towards this goal include \cite{baxter2000model}~and linear settings of \cite{lounici2011oracle,pontil2013excess,wang2016distributed,cavallanti2010linear}. More recent works \cite{hanneke2020no,lu2021power,kong2020meta,wu2020understanding,garg2020functional,gulluk2021sample,du2020few,tripuraneni2020theory,qin2022non,tripuraneni2020provable,sun2021towards,maurer2016benefit,arora2019theoretical} consider variations beyond supervised learning, concrete settings or established more refined upper/lower performance bounds. There is also a long line of works related to model agnostic meta-learning \cite{finn2017model,denevi2019online,balcan2019provable,khodak2019adaptive}. Unlike these works, we consider the CL setting and show how the representation learned by earlier tasks provably helps learning the new tasks with fewer samples and better accuracy. 




\noindent\textbf{Neural network pruning.} Our work is naturally related to neural network pruning methods and compression techniques as we embed tasks into sub-networks. Large model sizes in deep learning have led to a substantial interest in model pruning/quantization \cite{han2015deep,hassibi1993second,lecun1990optimal}. DNN pruning has a diverse literature with various architectural, algorithmic, and hardware considerations \cite{sze2017efficient,han2015learning}.  Here, we mention the ones related our work. \cite{frankle2018lottery} empirically shows that a large DNN contains a small subset of favorable weights (for pruning), which can achieve similar performance to the original network when trained with the same initialization. \cite{zhou2019deconstructing,malach2020proving,pensia2020optimal} demonstrate that there are subsets with good test performance even without any training and provide theoretical guarantees. In relation \cite{chang2021provable} establishes the theoretical benefits of training large over-parameterized networks to improve downstream pruning performance.

Although weight pruning is proven to be a good way to reduce model parameters and maintain performance, practically, it does not lead to compute efficiency except for some dedicated hardwares~\cite{han2016eie}. Unlike weight pruning, structured/channel pruning prunes the model at the channel level which results in a slim sub-network carrying much less FLOPs than the original dense model~\cite{liu2017learning, zhuang2020neuron, wen2016learning, ye2018rethinking}. For example~\cite{wen2016learning,zhou2016less,alvarez2016learning,lebedev2016fast,he2017channel} prune models by adding a sparse regularization over model weights whereas \cite{liu2017learning, zhuang2020neuron} only add regularization over channel factors, and prune channels with lower scaling factors. However, these prior works don't focus on the scenario where almost all of FLOPs pruned, for example with only 1\% of original FLOPs remained. To achieve this goal, we present an innovative channel pruning method based on FLOP-aware penalization. Our technique is inspired from \cite{liu2017learning, zhuang2020neuron} (it uses sparsity regularization over BatchNorm weights only) however it outperforms both methods as demonstrated in Appendix~\ref{pruning sec}.






\section{Theoretical Analysis of Data Efficiency and Continual Representation Learning}
\section{Empirical and Theoretical Insights for Continual Representation Learning}\label{sec:crl}
In this section, we discuss continual learning from the representation learning perspective. {We first present our experimental insights which show that (1) features learned from previous tasks help reduce the sample complexity of new tasks and (2) the order of task sequence (in terms of diversity and sample size) is critical for the success of CRL. In Section~\ref{sec:crl_theory}, we present our theoretical framework and a rigorous analysis in support of our experimental findings.}


\subsection{Empirical Investigation of CRL}\label{sec:crl_exp}
We further elucidate upon Figure \ref{figure 2 label} and discuss the role of sample size (for both past and new tasks) and task diversity.

\noindent\textbf{Investigating data efficiency (Fig \ref{fig:LS})} A good test to assess benefit of CRL is by constructing settings where new tasks have fewer samples. Consider SplitCIFAR100 for which $100$ classes are randomly partitioned into $20$ tasks. We partition the tasks into two sets: a continual learning set ${\cal{D}}_{cl}=\{\mathcal{T}_1,\dots,\mathcal{T}_{15}\}$ and a test set ${\cal{D}}_t=\{\mathcal{T}_{16},\dots,\mathcal{T}_{20}\}$. Test set is used to assess data efficiency and therefore, (intentionally) contains only 10\% of the original sample size (250 samples per task instead of 2500). We first train the network sequentially using  ${\cal{D}}_{cl}$  via ESPN/PackNet and create checkpoints of the supernetwork at different task IDs: At time $t$, we get a supernet \ylm{$s_t$ that is} trained with $\mathcal{T}_1,\dots,\mathcal{T}_t$, for $t\leq15$. {$t=0$} stands for the initial supernet without any training. Then we assess the representation quality of different supernet by individually training tasks in ${\cal{D}}_t$ on it. Fig.~\ref{fig:LS} displays the test accuracy of tasks in ${\cal{D}}_t$ where we used SplitCIFAR100 setting detailed in Sec.~\ref{sec:setting}. Figure~\ref{fig:LS} shows that ESPN-0.2,  ESPN-1, and PackNet methods all benefit from features trained by earlier tasks since the accuracy is above $75\%$  when we use supernets trained sequentially with multiple tasks. In contrast, individual learning trains separate models for each task where no knowledge is transferred; the accuracy is close to  $68\%$. Note that the performance of ESPN gradually increases with the growing number of continual tasks. This reveals its ability to successfully transfer knowledge from previous learned tasks and reduce sample complexity.
\input{sec/fig/diversity}
\noindent\textbf{Importance of task order and   diversity.} To study how task order and diversity benefits CL, similar to \cite{mallya2018packnet,hung2019compacting}, we use $6$ image classification tasks, where ImageNet-1k~\cite{krizhevsky2012imagenet} is the first task, followed by CUBS~\cite{wah2011caltech}, Stanford Cars~\cite{krause20133d}, Flowers~\cite{nilsback2008automated}, WikiArt~\cite{saleh2015large} and Sketch~\cite{eitz2012humans}. Intuitively, ImageNet should be trained first because of its higher diversity. 
Figure~\ref{fig:diversity} shows the accuracy improvement on the $5$ tasks that follow ImageNet pretraining compared to individual training. 
The results are displayed in Fig.~\ref{fig:diversity} where Green bars are CL with ImageNet as the first task, Orange is CL without ImageNet, and Blue is Individual training. In essence, this shows the importance of initial representation diversity in CL since results with the ImageNet pretraining (Green) are consistently and strictly better than no pretraining (Orange) and Individual (Blue).
We note that related findings for zero-forgetting  CL have been reported in \cite{hung2019compacting,mallya2018piggyback,mallya2018packnet,mallya2018piggyback,tu2020extending} which further motivates our theory in Sec \ref{sec:crl_theory}. Unlike these works, our experiment aims to isolate the CRL benefit of ImageNet by training other 5 tasks continually without ImageNet.  }%\textcolor{blue}{We defer experimental details to the supplementary material.}



\noindent\textbf{Importance of sample size.}
Finally, we show that the sample size is also critical for CRL because one can build higher-quality (less noisy) features with more data. To this end, we devise another experiment based on SplitCIFAR100 dataset. Instead of using original tasks each with 5 classes and 2,500 samples, we train the first task (task ID 1 in Fig.~\ref{fig:Sample}) with 2,500 samples, then decrease the sample size for all the following tasks using the rule $2,500\times (1/20)^{\text{(ID-1)}/19}$ until the last task (task ID 20 in Fig.~\ref{fig:Sample}) has only 125 training samples. Figure~\ref{fig:Sample} presents the results, where solid curves are obtained by training Task ID 1 to 20 with decreasing sample size, dashed curves are for training Task ID 20 to 1 with increasing sample size, and dotted curves are for individual training where task order does not matter. The accuracy curves are smoothed with a moving average and displayed in the decreasing order from Task ID 1 to 20. The results support our intuition that training large sample tasks first (decreasing order) performs better, as larger tasks build high quality representations that benefit generalization for future small tasks with less data. More strikingly, the dotted Individual line falls strictly between solid and dashed curves. This means that increasing order actually \emph{hurts accuracy} whereas decreasing order \emph{helps accuracy} compared to training from scratch (i.e.~no representation). Specifically, decreasing helps (solid$>$dotted) on the right side of the figure where tasks are small (thanks to good initial representations) whereas increasing hurts on the left side where tasks are large. The latter is likely due to the fact that, adding a large task requires a larger/better subnetwork to achieve high accuracy, however, since we train small tasks first, supernetwork runs out of sufficient free weights for a large subnetwork.

\subsection{Theoretical Analysis and Performance Bounds for Continual Representation Learning}\label{sec:crl_theory}

In this section, we provide theoretical analysis to explain how CRL provably promotes sample efficiency and benefits from initial tasks with large diversity and sample size.


We use $\ordet{\cdot}$ to denote equality up to a factor involving at most logarithmic terms. Following our experiments as well as \cite{maurer2016benefit}, a realistic model for deep representation learning is the compositional hypothesis $f=h\circ \phi$ where $h\in{\mtx{H}}$ is the classifier head and $\phi\in{\boldsymbol{\Phi}}$ is the shared backend feature extractor. In practice, $\phi$ has many more parameters than $h$. To model continual learning, let us assume that we already trained a frozen feature extractor $\phi_{\text{frz}}\in{\boldsymbol{\Phi}}_{\text{frz}}$\ylm{ on earlier tasks {from which $\phi$ can be learned faster}}. {Here ${\mtx{H}},{\boldsymbol{\Phi}},{\boldsymbol{\Phi}}_{\text{frz}}$ are the hypothesis sets to learn from.} Suppose we are now given a set of $T}%_{\text{new}}$ new tasks represented by independent datasets $\mathcal{S}_t=\{(\vct{x}_{ti},y_{ti})\}_{i=1}^N{\subset \mathcal{X}\times \mathcal{Y}}$, each drawn i.i.d.~from different distributions ${\cal{D}}_t$ for $1\leq t\leq T$. \ylm{$\mathcal{X},~\mathcal{Y}$ are the sets of feasible input features and labels respectively.} Our goal is to build the hypotheses $(f_t)_{t=1}^{T}%_{\text{new}}}:\mathcal{X}\rightarrow\mathbb{R}$ for these new tasks with small sample size $N$ while leveraging $\phi_{\text{frz}}\in{\boldsymbol{\Phi}}_{\text{frz}}$. 

 \textbf{CRL setting.} In a realistic CL setting the new tasks are allowed to learn new features. We will capture this with an \emph{incremental} feature extractor {$\phi_{\text{new}}\in {\boldsymbol{\Phi}}_{\text{new}}$, and represent the hypothesis of each task via the composition $f_t=h_t\circ \phi$ where $\phi=\phi_{\text{new}}+\phi_{\text{frz}}$. For PackNet/ESPN, ${\boldsymbol{\Phi}}_{\text{new}}$ corresponds to the free/trainable weights allocated to the new task, $\phi_{\text{frz}}$ corresponds to the trained weights of the earlier tasks and $\phi$ corresponds to the eventual task subnetwork and its weights. We will evaluate the quality of $\phi$ (which lies in the Minkowski sum ${\boldsymbol{\Phi}}_{\text{new}}+{\boldsymbol{\Phi}}_{\text{frz}}$) with respect to a global representation space ${\boldsymbol{\Phi}}$ which is chosen to be a superset: ${\boldsymbol{\Phi}}_{\text{new}}+{\boldsymbol{\Phi}}_{\text{frz}}\subseteq {\boldsymbol{\Phi}}$. For instance, in PackNet, ${\boldsymbol{\Phi}}_{\text{new}}+{\boldsymbol{\Phi}}_{\text{frz}}$ denotes the sparse sub-networks allocated to the \ylm{new and previous} tasks whereas ${\boldsymbol{\Phi}}$ corresponds to the full supernet.}



This motivates us to pose a CRL problem that builds a continual representation by searching for $\phi_{\text{new}}$ and combining with $\phi_{\text{frz}}$. Let ${\vct{h}}=(h_t)_{t=1}^T}%_{\text{new}}\in {\mtx{H}}^T}%_{\text{new}}$ denote all $T}%_{\text{new}}$ task-specific classifier heads, we solv
\begin{align}
\underset{\phi=\phi_{\text{new}}+\phi_{\text{frz}}}{\underset{{\vct{h}}\in{\mtx{H}}^T}%_{\text{new}},\phi_{\text{new}}\in{\boldsymbol{\Phi}}_{\text{new}}}{\arg\min}}&{\widehat{\cal{L}}}({\vct{h}},\phi):=\frac{1}{T}%_{\text{new}}}\sum_{t=1}^T}%_{\text{new}} {\widehat{\cal{L}}}_{\mathcal{S}_t}(h_t\circ\phi)\nonumber\\
\text{WHERE}\quad &{\widehat{\cal{L}}}_{\mathcal{S}_t}(f):=\frac{1}{N}\sum_{i=1}^N \ell(y_{ti},f(\vct{x}_{ti})) \tag{CRL}\label{crl}.
\end{align}
\noindent \textbf{Intuition:} \eqref{crl} aims to learn the task-specific headers ${\vct{h}}$ and the shared incremental representation $\phi_{\text{new}}$. Let $\cc{\cdot}$ be a complexity measure for a function class (e.g.~VC-dimension). Intuitions from the MTL literature would advocate that when the total sample size obeys $N\times T}%_{\text{new}}\gtrsim T}%_{\text{new}}\cc{{\mtx{H}}}+\cc{{\boldsymbol{\Phi}}_{\text{new}}}$, then \eqref{crl} would return generalizable solutions ${\vct{\hat{h}}},\hat{\phi}_{\text{new}}$. This is desirable as in practice $\phi_{\text{frz}}$ is a much more complex hypothesis obtained by training on many earlier tasks. Thus, from continual learning perspective, theoretical goals are:
\begin{enumerate}
    \item The sample size should only depend on the complexity $\cc{{\boldsymbol{\Phi}}_{\text{new}}}$ of the incremental representation rather than the combined complexity that can potentially be much larger ($\cc{{\boldsymbol{\Phi}}_{\text{new}}}+\cc{{\boldsymbol{\Phi}}_{\text{frz}}}\gg \cc{{\boldsymbol{\Phi}}_{\text{new}}}$).
    \item To explain Figure \ref{fig:LS}, we would like to quantify how frozen representation $\phi_{\text{frz}}$ can provably help accuracy. Ideally, thanks to $\phi_{\text{frz}}$, we can discover a near-optimal $\phi$ from the larger hypothesis set ${\boldsymbol{\Phi}}$.
    \item Finally, we emphasize that, we add the $T$ new tasks to the network in one round for the sake of cleaner exposition. }%\textcolor{blue}{In appendix, we provide synergistic theory and detailed investigation of the scenario where tasks are learned sequentially in a continual fashion and frozen features $\phi_{\text{frz}}$ evolve as we add more tasks.} In a nutshell, this theory explains Figure \ref{fig:Sample} by quantifying the role of sample size in the quality of continual representations.
\end{enumerate}


Before stating our technical results, we need to introduce a few definitions. To quantify the complexities of the search spaces ${\boldsymbol{\Phi}}_{\text{new}},{\mtx{H}}$, we introduce \emph{metric dimension} \cite{mendelson2003few}, which is a generalization of the VC-dimension \cite{vapnik2015uniform}.

\begin{definition}[Metric dimension] \label{def:cov} Let ${\mtx{G}}:\mathcal{Z}\rightarrow\mathcal{Z}'$ be a set of functions. {Let $\bar{C}_{\mathcal{Z}}>0$ be a scalar that is allowed to depend on $\mathcal{Z}$.} Let ${\mtx{G}}_{\varepsilon}$ be a minimal-size $\varepsilon$-cover of ${\mtx{G}}$ such that for any $g\in{\mtx{G}}$ there exists $g'\in {\mtx{G}}_{\varepsilon}$ that ensures $\sup_{\vct{x}\in\mathcal{Z}}\tn{g(\vct{x})-g'(\vct{x})}\leq \varepsilon$. The metric dimension $\cc{{\mtx{G}}}$ is the smallest number that satisfies $\log|{\mtx{G}}_{\varepsilon}|\leq \cc{{\mtx{G}}}\log(\bar{C}_{\mathcal{Z}}/\varepsilon)$ for all $\varepsilon>0$.
\end{definition}
{$\bar{C}_{\mathcal{Z}}$ typically depends only logarithmically on the Euclidean radius of the feature space under mild Lipschizness conditions, thus, $\bar{C}_{\mathcal{Z}}$ dependence will be dropped for cleaner exposition.} In practice, for neural networks or other parametric hypothesis, metric dimension is bounded by the number of trainable weights up to logarithmic factors \cite{barron2018approximation}.

Metric dimension will help us characterize the sample complexity. However, we also would like to understand when ${\boldsymbol{\Phi}}_{\text{frz}}$ can help. To this end, we introduce definitions that capture the population loss (infinite data limit) of new tasks and the \emph{feature compatibility} between the new tasks and $\phi_{\text{frz}}$ of old tasks. These definitions help decouple the finite sample size $N$ and the distribution of the new tasks.
\begin{definition}[Distributional quantities]\label{def pop}Define the population (infinite-sample) risk as ${\cal{L}}({\vct{h}},\phi)=\operatorname{\mathbb{E}}[{\widehat{\cal{L}}}({\vct{h}},\phi)]$. Define the optimal risk over representation ${\boldsymbol{\Phi}}$ as ${\cal{L}}^{\st}=\min_{{\vct{h}}\in{\mtx{H}}^T}%_{\text{new}},\phi\in{\boldsymbol{\Phi}}}{\cal{L}}({\vct{h}},\phi)$. Note that the optimal risk can always choose the best representations within ${\boldsymbol{\Phi}}_{\text{new}}$ and ${\boldsymbol{\Phi}}_{\text{frz}}$ since ${\boldsymbol{\Phi}}_{\text{new}}+{\boldsymbol{\Phi}}_{\text{frz}}\subseteq {\boldsymbol{\Phi}}$. Finally, define the optimal population risk using frozen $\phi_{\text{frz}}$ to be ${\cal{L}}^{\st}_{{\phi}_{\text{frz}}}=\min_{{\vct{h}}\in{\mtx{H}}^T}%_{\text{new}},\phi_{\text{new}}\in{\boldsymbol{\Phi}}_{\text{new}}}{\cal{L}}({\vct{h}},\phi)$ s.t.~$\phi=\phi_{\text{new}}+\phi_{\text{frz}}$.
\end{definition}
Following this, the \emph{representation mismatch} introduced below assesses the suboptimality of $\phi_{\text{frz}}$ for the new task distributions compared to the optimal hypothesis within ${\boldsymbol{\Phi}}$.
\begin{definition}[New \& old tasks mismatch]\label{def MM}
{The representation mismatch between the frozen features $\phi_{\text{frz}}$ and the new tasks} is defined as \vspace{-8pt}
\begin{align*}
    \text{MM}_{\text{frz}}}%^{\text{new}={\cal{L}}^{\st}_{\phi_{\text{frz}}}-{\cal{L}}^{\st}.
   
\end{align*}
\end{definition}
By construction, $\text{MM}_{\text{frz}}}%^{\text{new}$ is guaranteed to be non-negative. Additionally, $\text{MM}_{\text{frz}}}%^{\text{new}=0$ if we choose global space to be ${\boldsymbol{\Phi}}={\boldsymbol{\Phi}}_{\text{new}}+{\boldsymbol{\Phi}}_{\text{frz}}$ and $\phi_{\text{frz}}$ to be the optimal hypothesis wihin ${\boldsymbol{\Phi}}_{\text{frz}}$. With these definitions, we have the following generalization bound regarding \eqref{crl} problem. }%\textcolor{blue}{The proof is deferred to appendix\ylm{the Appendix \ref{app B}}.}
\begin{theorem}\label{cl thm} Let ${\vct{h}},{\vct{\hat{h}}}$ denote the set of classifiers $(h_t)_{t=1}^T}%_{\text{new}},({\hat{h}}_t)_{t=1}^T}%_{\text{new}}$ respectively and $({\vct{\hat{h}}},\hat{\phi}=\hat{\phi}_{\text{new}}+\phi_{\text{frz}})$ be the solution of \eqref{crl}. Suppose that the loss function $\ell(y,\hat{y})$ takes values on $[0,1]$ and is $\Gamma$-Lipschitz w.r.t.~$\hat{y}$. {Suppose that input set $\mathcal{X}$ is bounded and all $\phi_{\text{new}}\in{\boldsymbol{\Phi}}_{\text{new}},~h\in{\mtx{H}}$, \ylm{${\boldsymbol{\Phi}}_{\text{new}}\in\{{\boldsymbol{\Phi}}_{\text{new}}^i,1\leq i\leq k\}$,} and $\phi_{\text{frz}}$ have Lipschitz constants upper bounded with respect to Euclidean distance}. With probability at least $1-2e^{-\tau}$, the task-averaged population risk of the solution $({\vct{\hat{h}}},\hat{\phi})$ obeys 
{
\begin{align}
{\cal{L}}({\vct{\hat{h}}},\hat{\phi})&\leq {\cal{L}}^{\st}_{\phi_{\text{frz}}}+ \sqrt{\frac{\ordet{T}%_{\text{new}}\cc{{\mtx{H}}}+\cc{{\boldsymbol{\Phi}}_{\text{new}}}+\tau}}{T}%_{\text{new}} N}},\nonumber\\
&\leq {\cal{L}}^{\st}+{\text{MM}_{\text{frz}}}%^{\text{new}}+{\sqrt{\frac{\ordet{T}%_{\text{new}}\cc{{\mtx{H}}}+\cc{{\boldsymbol{\Phi}}_{\text{new}}}+\tau}}{T}%_{\text{new}} N}}}.\nonumber
\end{align}}
\end{theorem} 
In words, this theorem shows that as soon as the total sample complexity obeys $T}%_{\text{new}} N\gtrsim T}%_{\text{new}} \cc{{\mtx{H}}}+\cc{{\boldsymbol{\Phi}}_{\text{new}}}$, we achieve small excess statistical risk and avoid the sample cost of learning $\phi_{\text{frz}}$ from scratch. {Importantly, the sample cost $\cc{{\boldsymbol{\Phi}}_{\text{new}}}$ of learning the incremental representation is shared between the tasks since per-task sample size $N$ only needs to grow with $\cc{{\boldsymbol{\Phi}}_{\text{new}}}/T$.} 
Reusing ${\boldsymbol{\Phi}}_{\text{frz}}$ comes at the cost of a prediction bias $\text{MM}_{\text{frz}}}%^{\text{new}$ arising from the feature mismatch. Also, with access to a larger sample size (e.g.~$T}%_{\text{new}} N\gtrsim T}%_{\text{new}}\cc{{\mtx{H}}}+ \cc{{\boldsymbol{\Phi}}}$), new tasks can learn a near-optimal $\phi^\star\in{\boldsymbol{\Phi}}$ from scratch\footnote{In this statement, we ignore the continual nature of the problem and allow $\phi_{\text{frz}}$ to be overridden for the new tasks if necessary.}. Thus, the benefit of \ref{crl} on data-efficiency is most visible when the new tasks have few samples, which is exactly the setting in Figure \ref{fig:LS}.  


\noindent$\bullet$ \textbf{${\boldsymbol{\Phi}}_{\text{new}}$ \& representation diversity.} Imagine the scenario where $\phi_{\text{frz}}$ is already very rich and approximately coincides with the optimal hypothesis within the global space ${\boldsymbol{\Phi}}$. This is precisely the ImageNet setting of Figure \ref{fig:diversity} where even fine-tuning $\phi_{\text{frz}}$ will achieve respectable results. Mathematically, this corresponds to the scenario where ${\boldsymbol{\Phi}}_{\text{new}}$ is empty set but the mismatch is $\text{MM}_{\text{frz}}}%^{\text{new}\approx 0$. In this case, our theorem reduces to the standard few-shot learning risk where the only cost is learning ${\mtx{H}}$ i.e.~${\cal{L}}({\vct{\hat{h}}},\hat{\phi})\leq {\cal{L}}^{\st}+ \sqrt{{\ordet{\cc{{\mtx{H}}}}}/{N}}$.


\noindent$\bullet$ \textbf{$\text{MM}_{\text{frz}}}%^{\text{new}$ \& initial sample size.} Note that $\phi_{\text{frz}}$ is built using previous tasks which has finite samples. }%\textcolor{blue}{The sequential CL analysis we develop in appendix decomposes mismatch as $\text{MM}_{\text{frz}}}%^{\text{new}\lesssim\text{MM}_{\text{frz}}}%^{\text{new}^\star+\ordet{\cc{{\boldsymbol{\Phi}}_{\text{frz}}}/N_{\text{prev}}}$ where $\text{MM}_{\text{frz}}}%^{\text{new}^\star$ is the mismatch if previous tasks had $N_{\text{prev}}=\infty$ samples and $\ordet{\cc{{\boldsymbol{\Phi}}_{\text{frz}}}/N_{\text{prev}}}$ is the excess mismatch due to finite samples shedding light on Figure \ref{fig:Sample}. } 


Our analysis is related to the literature on representation learning theory \cite{maurer2016benefit,kong2020meta,wu2020understanding,du2020few,gulluk2021sample,tripuraneni2020theory,arora2019theoretical}. Unlike these works, we consider the CL setting and show how the representation learned by earlier tasks provably helps learning the new tasks with fewer samples and how initial representation diversity and sample size benefit CRL. }%\textcolor{blue}{Importantly, we also specialize our general results to neural networks (see Theorem \ref{cl thm3} and Appendix \ref{app:application}) to obtain tight sample complexity bounds (in the degrees of freedom).}

}%[1]{\textcolor{red}{#1}}% \textcolor{red}{
\noindent$\bullet$ \textbf{Adding tasks sequentially.} Theorem \ref{seq thm} in the appendix also provides guarantees for the practical setting where $T$ tasks are added sequentially to the network. Informally, this theorem provides the following cumulative generalization bound on $T$ tasks (see Line \eqref{gen bound 3})
\[
\sum_{i=1}^T \text{``excess risk of task $t$''} \leq \sum_{i=1}^T \text{``mismatch of task $t$''}+T^2 \times \text{``statistical error per task''}.
\]
Here, ``mismatch of task $t$'' is evaluated with respect to earlier $t-1$ tasks assuming those $t-1$ tasks have infinite samples. ``statistical error per task'' is the generalization risk arising from each task having finite samples. In our bound, the growth of statistical error is quadratic in $T$ because when the new task $t$ uses imperfect representations learned by the earlier $t-1$ tasks (which have finite samples), the statistical error may aggregate. The overall proof idea is decoupling the population risk from empirical risk by defining the `mismatch of task $t$'' in terms of the landscape of the population loss (see Definition \ref{def pop seq}).}











\section{Preliminaries}

\textbf{Notation.} Let $\{\mathcal{T}_\tau, \tau\in \mathbb N^+\}$ be task identifier and ${\cal{D}}_\tau$ be its sample distribution. $\{(\vct{x}_{\tau,i},y_{\tau,i})\}_{i=1}^{n_\tau}$ are input-output pairs sampled from ${\cal{D}}_\tau$ to generate $\mathcal{T}_\tau$ where $n_\tau$ is the sample size.
In the following discussion, we consider $\tau\leq T$ where $T\in\mathbb N^+$.


\subsection{}

\begin{align*}
    {\boldsymbol{\theta}}^M&:=\arg\min_{{\boldsymbol{\theta}}}\sum_{\tau=1}^T\sum_{i=1}^{n_\tau}\ell(f(\vct{x}_{\tau,i};{\boldsymbol{\theta}}),y_{\tau,i})\\
    {\boldsymbol{\theta}}^I_\tau&:=\arg\min_{{\boldsymbol{\theta}}}\sum_{i=1}^{n_\tau}\ell(f(\vct{x}_{\tau,i};{\boldsymbol{\theta}}),y_{\tau,i})\\
    {\boldsymbol{\theta}}_\tau^{CL}&:=\arg\min_{{\boldsymbol{\theta}},f_\tau}\sum_{i=1}^{n_\tau}\ell(f_\tau(\vct{x}_{\tau,i};{\boldsymbol{\theta}}_{\tau-1}^{CL},{\boldsymbol{\theta}}),y_{\tau,i})
\end{align*}
\section{Related Work

Our contributions are closely related to the representation learning theory as well as continual learning methods.


\noindent\textbf{Representation learning theory.} The rise of deep learning motivated a growing interest in theoretical principles behind representation learning. Similar in spirit to this project, \cite{maurer2016benefit} provides generalization bounds for representation-based transfer learning in terms of the Rademacher complexities associated with the source and target tasks. Some of the earliest works towards this goal include \cite{baxter2000model}~and linear settings of \cite{lounici2011oracle,pontil2013excess,wang2016distributed,cavallanti2010linear}. More recent works \cite{hanneke2020no,lu2021power,kong2020meta,qin2022non,wu2020understanding,garg2020functional,gulluk2021sample,xu2021statistical,du2020few,tripuraneni2020theory,tripuraneni2020provable,maurer2016benefit,arora2019theoretical,chen2021weighted,sun2021towards} consider variations beyond supervised learning, concrete settings or established more refined upper/lower performance bounds. There is also a long line of works related to model agnostic meta-learning \cite{finn2017model,denevi2019online,balcan2019provable,khodak2019adaptive}. {Unlike these works, we consider the CL setting where tasks arrive sequentially and establish how the representations learned by earlier tasks help learning the new tasks with fewer samples and better accuracy.}

\noindent \textbf{Continual learning.}
A number of methods for continual and lifelong learning have been proposed to tackle the problem of catastrophic forgetting. Existing approaches can be broadly categorized into three groups: replay-based~\cite{lopez2017gradient,borsos2020coresets}, regularization-based~\cite{kirkpatrick2017overcoming, jung2020continual}, and architecture-based methods~\cite{yoon2017lifelong}. }%[1]{\textcolor{red}{#1}}% \textcolor{red}{Recent work \cite{ramesh2021model} explores statistical challenges associated with continual learning in terms of the relatedness across tasks through replay-based strategies. In comparison, we focus on the benefits of representation learning and establish how representation built for previous tasks can drastically reduce the sample complexity on new tasks in terms of representation mismatch without access to past data. Another key difference is that our analysis allows for learning multiple sequential tasks rather than a single task.} In our work, consistent with our theory, we focus on zero-forgetting CL~\cite{mallya2018packnet,hung2019compacting,mallya2018piggyback,wortsman2020supermasks,kaushik2021understanding,tu2020extending}, which is a sub-branch of architecture-based methods and completely eliminates forgetting. \cite{hung2019compacting, kaushik2021understanding, mallya2018packnet} train a sub-network for each task and implement zero-forgetting by freezing the trained parameters, while \cite{mallya2018piggyback} trains masks only over pretrained model. Inspired by \cite{ramanujan2020s}, SupSup \cite{wortsman2020supermasks} trains a binary mask for each task while keeping the underlying model fixed at initialization.
However in order to embed the super-network with many tasks or to achieve acceptable performance over a masked random network, sufficiently large networks are needed.
Additionally, without accounting for the network structure, the inference-time compute cost of these networks is high even for simple tasks. Our proposed  method is designed to overcome these challenges. In addition to achieving zero-forgetting, our algorithm also provides efficient inference in terms of FLOPs.









\subsection{Linear regression}
Consider a linear feature matrix $\vct\Phi=[\vct \phi_1, \vct \phi_2,...,\vct \phi_r]\in\mathbb{R}^{d\times r}$ where $d\gg r$ and $\{\vct\phi_i\}_{i=1}^r$ are orthogonal. Given $\{(\vct{x}_{\tau,i},y_{\tau,i})\}_{i=1}^{n_\tau}\sim{\cal{D}}_\tau$, describe a linear regression problem as
\begin{align*}
    y=\vct{x}^\top\vct\beta+\epsilon~~\text{where}~~\vct\beta=\vct\Phi \vct{w}=\sum_{i=1}^r\vct{w}_i\vct\phi_i
\end{align*}
where $\vct{w}\in\mathbb{R}^r$ and $\epsilon$ is random noise. Therefore we have $\vct\beta\in\text{Span}\{\vct\phi_1,...,\vct\phi_r\}$. Define linear continual regression problem as 
\begin{align*}
    \tau\leq r:&\min_{\vct\phi}\|{\mtx{X}}_\tau\vct\phi-Y_\tau\|_2\\
    &\Longrightarrow\hat{\vct\phi}_\tau=({\mtx{X}}_\tau^\top {\mtx{X}}_\tau)^{-1}{\mtx{X}}_\tau^\top Y_\tau\\
    \tau > r:&\min_{\vct{w}}\|{\mtx{X}}_\tau\hat\Phi\cdot\vct{w}-Y_\tau\|_2~~\text{where}~~\hat{\vct\Phi}=[\hat{\vct\phi}_1,...,\hat{\vct\phi}_r]\\
    &\Longrightarrow\hat{\vct{w}}_\tau =(({\mtx{X}}_\tau\hat{\vct\Phi})^\top{\mtx{X}}_\tau\hat{\vct\Phi})^{-1}{\mtx{X}}_\tau\hat{\vct\Phi} Y_\tau
\end{align*}
\subsection{General case}
Consider feature extractor $\phi\in\Phi$ and task-specific function $h\in\mathcal{H}$. Denote task predictor $f:=h\circ\phi\in\mathcal{H}\times \Phi$, where $\mathcal{H}$ and $\Phi$ are continuous search spaces. Let $\phi^M$ be optimal representation that trained from multitasks learning.
\begin{align*}
    \hat\phi^M&=\arg\min_{\phi\in\Phi}\frac{1}{T}\sum_{\tau=1}^T{\cal{L}}(h_\tau\circ\phi(\vct{x}_\tau),y_\tau)\\
    &\text{s.t.}~~h_\tau=\arg\min_{\vct{h}_\tau\in\mathcal{H}}\frac{1}{n_\tau}\sum_{i=1}^{n_\tau}{\cal{L}}(h_\tau\circ\phi(\vct{x}_{\tau,i}),y_{\tau,i})
\end{align*}
In practical problem, $\Phi$ is not uniformly distributed. Here let $\Phi'=(\Phi_1\times\Phi_2\times...\times\Phi_T)$ and assume that for any $\phi\in\Phi$, there exists $\phi'\in\Phi'$ that ${\cal{L}}(\phi')-{\cal{L}}(\phi)<\varepsilon$.

Define our continual learning problem as
\begin{align*}
    \hat\phi_1&=\arg\min_{\phi\in\Phi_1, h_1\in\mathcal{H}}{\cal{L}}_{{\cal{D}}_1}(h_1\circ\phi)\\
    \hat\phi_\tau&=\arg\min_{\phi\in(\hat\phi_1\times...\times\hat\phi_{\tau-1})\times\Phi_\tau, h_\tau\in\mathcal{H}}{\cal{L}}_{{\cal{D}}_\tau}(h_\tau\circ\phi)
\end{align*}
We will show that for all $\phi\in(\hat\phi_1\times...\times\hat\phi_{\tau-1})\times\Phi_\tau$
\begin{align*}
    Dist(\hat\phi^M,\phi) =\mathcal{O}()
\end{align*}
Consider about individual learning that
\begin{align*}
    \hat\phi^I_\tau=\arg\min_{\phi\in\Phi,h_\tau\in\mathcal{H}}{\cal{L}}(h_\tau\circ\phi)
\end{align*}
Then we will have
\begin{align*}
    Dist(\hat\phi^M,\hat\phi_\tau^I)=\mathcal{O}()
\end{align*}


\section{Experimental Evidences in CL}\label{sec:exp}















\subsection{Investigation of Data Efficiency
To investigate how features learned from earlier tasks help in training new tasks in continual representation learning, we introduce a new experimental setting based on SplitCIFAR100 described in Sec~\ref{sec:setting} where 100 classes are randomly partitioned into 20 5-class classification tasks. Let us denote the original $20$ tasks as $\mathcal{T}_1,...,\mathcal{T}_{20}$. First, we split the tasks into a continual learning task set ${\cal{D}}_{cl}=\{\mathcal{T}_1,..,\mathcal{T}_{15}\}$ and a test task set ${\cal{D}}_t=\{\mathcal{T}_{16},...,\mathcal{T}_{20}\}$. Tasks in ${\cal{D}}_{cl}$ {retain} all $2500$ training samples per task. To test the data efficiency of CL, ${\cal{D}}_t$ tasks only have 10\% of their original training samples (i.e., $250$ for each task). We use ResNet18 described in Sec~\ref{sec:setting}. To quantify the benefits of learned feature representations in the continual learning process, we train the network sequentially using tasks in ${\cal{D}}_{cl}$ and test knowledge transferability separately over ${\cal{D}}_t$ after each task in ${\cal{D}}_{cl}$ added to the network. \red{Our goal is to quantify the benefits of learned feature representations in the continual learning process. To this aim, we train the network using samples from test tasks in ${\cal{D}}_t$ after every task in ${\cal{D}}_{cl}$ and use the test accuracy of ${\cal{D}}_t$ as a proxy for quantifying the quality of representations learned in previous tasks}.
\input{sec/fig/cifar_oc}


First, we use the CL method to sequentially learn tasks ${\cal{D}}_{cl}$ and save the supernet after every task. In this manner, we get 16 different supernets $\vct{s}_0,\vct{s}_1,...,\vct{s}_{15}$, where $\vct{s}_0$ stands for the initial supernet with no task, and $\vct{s}_i$ stands for the supernet after training $\mathcal{T}_1,\dots,\mathcal{T}_i$ using CL algorithm for $i\leq 15$. The source of the knowledge in $\vct{s}_i$ is the first $i$ tasks of the ${\cal{D}}_{cl}$ dataset. The test dataset ${\cal{D}}_t$ is used to test the transferability of the knowledge contained in supernet $\vct{s}_i$. {We separately train each of the 5 tasks of ${\cal{D}}_t$ by adding them into $\vct{s}_i$ in a continual learning manner (i.e.~by preserving the weights trained for ${\cal{D}}_{cl}$ }%\textcolor{blue}{and following the same channel pruning and weight allocation rules}).} This is done for all $16$ supernets $\vct{s}_i$. Training at supernet $\vct{s}_0$ is equivalent to training the network from scratch. Training at supernet $\vct{s}_i$ shows how features learned from $\mathcal{T}_1,\dots,\mathcal{T}_i$ help the test tasks in terms of data efficiency. We plot the average test accuracy on }%\textcolor{blue}{$5$ trails in Fig.~\ref{fig:LS} for each $\vct{s}_i$ and test accuracy in each trail is averaged over $5$ test tasks.}

Figure~\ref{fig:LS} summarizes the results of our experiment. Orange and Blue curves display ESPN-1 and ESPN-0.2 results. We also run PackNet, SupSup, and Individual-0.2/-1 baselines using the same setting as in Table~\ref{CIFARtable}\som{ with only difference being weight allocation $\alpha=0.05$ rather than $\alpha=0.1$}. Since SupSup uses a fixed model with random initialization and learns a mask for each task separately, no knowledge is transferred. This is why dashed lines are strictly horizontal. Figure~\ref{fig:LS} shows that ESPN-0.2, ESPN-1, and PackNet methods benefit from features trained by earlier tasks since the test accuracy significantly improves when we train the new task on top of a network containing more continual tasks. Observe that, the performance of ESPN gradually increases with the growing number of continual tasks. This reveals its ability to successfully transfer knowledge from previous learned tasks. \ylm{We believe ESPN's advantage over PackNet arises from the Task-specific BatchNorm weights which allow ESPN to better re-purpose the features learned by earlier tasks. The small gap between ESPN-0.2 and ESPN-1 also shows that our ESPN algorithm performs well in finding the most relevant features despite channel sparsity restrictions.} Complementing these experiments, the next section provides theoretical insights into the representation learning ability of PackNet/ESPN.


\subsection{Importance of task order: diversity and sample size}

\input{sec/fig/sample_size}


\section{Appendix}
\subsection{Channel Pruning}
To prune channel effectively and iteratively, we introduce channel mask in continuous value ($\mathcal{M}_l\in\mathbb{R}^{n_l}, l\in [L]$) to induce sparsity where $L$ is layer count each with $n_l$ channels/neurons. In our experiments, inspired by \cite{liu2017learning} we adopt BatchNorm layers and extract their weights as masks. Then we have
\begin{align*}
	\vct{x}_l=(\mathcal{M}_l\cdot \text{norm}(\text{conv}(\vct{x}_{l-1})))_+,~~\text{norm}({\vct{z}})=\frac{{\vct{z}}-\mu_{\vct{z}}}{\sqrt{\sigma_{\vct{z}}^2+\epsilon}}
\end{align*}
where $\vct{x}_l$ is the output feature of $l^{th}$ layer and $\mathcal{M}_l$ is trainable masked weights of BN layer.


Given a single task, structure pruning is trying to find and remove insignificant channels/neurons without hurting performance. To achieve this goal, we propose a FLOP-base sparsity regularization
\begin{align*}
	\mathcal{R}(\mathcal{M})&=\sum_{l\in [L]}\lambda_l\|\mathcal{M}_l\|_1\\
	&=\sum_{l\in[L]}\frac{g(\text{FLOP}_l(\mathcal{M}))}{\sum_{j\in[L]}g(\text{FLOP}_j(\mathcal{M}))}\|\mathcal{M}_l\|_1
\end{align*}
where $\lambda_l$, $\mathcal{M}_l$ and $\text{FLOP}_l(\cdot)$ denote regularization parameter, channel mask and FLOP calculation function of $l^{th}$ layer. We apply $\ell_1$-norm to achieve sparsity, since it enforces unimportant elements to zeros. $g(\cdot)$ is monotonically increasing function used to calculate $\lambda_l$ automatically.
In deep neural network, different layer has different FLOPs load corresponding to its input/output feature size and operations. Therefore, it is unfair to use global regularization parameters. We adopt layerwise regularization parameter $\lambda_l$ and relate it to FLOPs load. Our goal is to push layers with more FLOPs sparser. Results show that this approach achieves good results especially when the network is pruned to very sparse. In our experiments, we use $g(x)=\sqrt{x}$.


\begin{table}[t]
    \small
    \caption{\small{Continual learning results}}
    \centering
    \begin{tabular}{lcc}
    \midrule
    & RotatedMNIST & PermutedMNIST\\
    \midrule
    ESPN-0.2 &97.21$\pm$0.007 95.78$\pm$0.015\\
    PackNet &95.66$\pm$0.014 & 92.88$\pm$0.024\\
    Individual-0.2&95.48$\pm$0.370 & 95.36$\pm$0.303\\
    Individual-1 &97.36$\pm$0.149 & 97.31$\pm$0.107\\
    \midrule
    \end{tabular}\label{MNISTtable}
\end{table}





\section*{Organization of the Appendix}
Supplementary material is organized as follows.
\begin{enumerate}
\item Appendix~\ref{appsec: espnalgo} discusses additional details of our ESPN algorithm. In Appendix~\ref{pruning sec} we evaluate the benefits of our FLOP-aware pruning and weight allocation strategies.
\item Appendix~\ref{app:application} discusses an application of CRL, that is we introduce a shallow network by setting feature extractor $\phi$ and classifier $h$ to be specific functions.
\item Appendix~\ref{app B} provides the proof of our Theorem \ref{cl thm} and also proves Theorem~\ref{cl thm3} proposed in Appendix~\ref{app:application}.
\item Appendix~\ref{app C} proposes Theorem \ref{seq thm} which provides guarantee for continual representation learning when the $T$ tasks are added into the super-network in a sequential fashion. This theorem adds further insights into the bias-variance tradeoffs surrounding sequential learning (e.g.~how representation mismatch might aggregate as we add more tasks).
\item Appendix~\ref{app:related} provides an expanded discussion of related work on continual learning, representation learning and neural network pruning.
\item Appendix~\ref{app:imagenet} discusses our implementation setting of experiment in Figure~\ref{fig:diversity}.
\end{enumerate}
}