# Formal Derivations {#appendix_theory}

## General

-   : number of samples of task $t$

-   : task batch $\in \mathbb{R}^{n_t\times m}$

-   : task sample $\in \mathbb{R}^{1\times m}$

-   : hidden dimension of layer $l$

-   : output dimension (number of classes)

-   : ReLU activation function, $\sigma(z):=\max{(0,z)}$

-   : hidden layer $\in \mathbb{R}^{m\times d}$

-   : output head (for task $t$) $\in \mathbb{R}^{d\times o}$, frozen
    and initialized
    $\vartheta_t\overset{\text{iid}}{\sim}\mathcal{N}(0,\frac{1}{d})$

-   : pre-activations $z_t=X_t\Theta \in \mathbb{R}^{1\times d}$

-   : activations $h_t=\sigma(z_t)\in \mathbb{R}^{1\times d}$

-   : logits $f(X_t) = h_t\vartheta_t$

-   : residuals $E:=f(X_t)-Y_t\in \mathbb{R}^{1\times o}$

-   : logit residuals
    $\delta_t:=\text{Softmax}(f(X_t))-Y_t\in \mathbb{R}^{1\times o}$

-   : loss task $t$

-   : $i^{th}$ sample contribution to loss task $t$

## Mathematical Identities

#### Frobenius product rank one identity 

$$\label{frob_id_1}
     \langle AB^\top, CD^\top\rangle_F=\text{Tr}(B^\top A^\top CD)=(A^\top C)(B^\top D)= \langle A,C\rangle_F \langle B,D\rangle_F$$

#### Trace

$$\label{eq:tr_identity_1}
    \operatorname{Tr}(AB)=\operatorname{Tr}(BA)$$

#### Vec

$$\Theta\in\mathbb{R}^{m\times d}\to\theta:=\operatorname{vec}(\Theta)\in \mathbb{R}^{1\times md}$$

#### Kronecker product

$$\label{eq:kronecker_identity_1}
    \underbrace{a}_{1\times n} \otimes_\text{Kr} \underbrace{b}_{1\times m} = \underbrace{\operatorname{vec}(b^\top a)}_{1\times (nm)}$$

The \"vec trick\" $$\label{eq:kronecker_identity_2}
(A \otimes_\text{Kr} B)\operatorname{vec}(C)=\operatorname{vec}(BCA^\top)$$

## Definitions

### Forgetting

$$\label{eq:cf_definition}
    \text{CF} := \langle \nabla_{\Theta} \mathcal{L}_1, \nabla_{\Theta} \mathcal{L}_{2} \rangle_F$$

### Mean Squared Error (MSE)

$$\text{L}_t = \frac{1}{n_t}\sum_{i=1}^{n_t} \mathcal{L}^{(i)}_t$$

$$\mathcal{L}^{(i)}_t=\frac{1}{2}\Vert E_t^{(i)} \Vert^2=\frac{1}{2}(f(X^{(i)}_t)-Y^{(i)}_t)(f(X^{(i)}_t)-Y^{(i)}_t)^\top$$

### Cross-Entropy

$$\mathcal{L}_t^{(i)}=-\sum_{c=1}^o Y_{t,c}^{(i)}\log\Big(\text{Softmax}\big(f_t(X_t^{(i)})\big)\Big)$$

### Loss Gradient

Applying the chain rule on a generic loss $\mathcal{L}$, the vectorized
gradient w.r.t. the layer parameter $\Theta$ can be separated into a
loss-dependent and model-dependent component.

$$\nabla_\Theta \mathcal{L}=\underbrace{\nabla_f \mathcal{L}}_{\text{loss}^\dagger}\cdot \underbrace{J_f(\Theta)}_{\text{model}^\ddagger}$$

-   since $\mathcal{L}$ is a scalar-valued function, this term is the
    gradient w.r.t. the model function $f$,
    $\nabla_\Theta \mathcal{L}\in\mathbb{R}^{1\times o}$

-   is the Jacobian of the model output w.r.t the parameter
    $\Theta\in\mathbb{R}^{m\times d}$. The model $f$ is generally a
    vector-valued function if $o>1$. Therefore, it is a Jacobian rather
    than a gradient and $J_f(\Theta)\in \mathbb{R}^{o\times m\times d}$

In the vectorized form $\theta=\operatorname{vec}(\Theta)$

$$\nabla_\theta\mathcal{L}=\underbrace{\nabla_f\mathcal{L}}_{1\times o}\cdot \underbrace{J_f(\theta)}_{o\times md}$$

The term $\dagger$ can be calculated independently of the model
considered.

$$\nabla_f \mathcal{L}=\begin{cases}
        E_t:= f_t(X_t^{(i)})-Y_t^{(i)} & \text{MSE} \\
        \delta_t:= \text{Softmax}\big(f_t(X_t^{(i)})\big)-Y_t^{(i)} & \text{Cross-Entropy}
    \end{cases}$$

In the next derivations, $E_t$ and $\delta_t$ can be interchanged
depending on the loss function.

To further simplify our analysis, we set $o=1$ and $e$ as the generic
scalar of the type of loss considered. The Jacobian becomes real valued
and can be substituted by the gradient $\nabla_\theta f$

$$\label{eq:loss_decomposition}
    \nabla_\theta\mathcal{L}=e\nabla_\theta f$$

In the following subsections, $\nabla_\theta f$ will be calculated for
different models $f$.

## Linear Model

$$f(X)=z\vartheta = X\Theta\vartheta$$

The gradient for Equation
[\[eq:loss_decomposition\]](#eq:loss_decomposition){reference-type="ref"
reference="eq:loss_decomposition"} is decomposed by the chain rule.

$$\nabla_\theta f =\nabla_z f\cdot J_z(\theta)$$

$$\begin{aligned}
    & \nabla_z f = \vartheta^\top \\
    & J_{f}(\theta)=(I_d \otimes_\text{Kr} X)\\
\end{aligned}$$

$\otimes_\text{Kr}$ is the Kronecker product of two matrices:
$(k\times j) \otimes_\text{Kr} (m\times n)\to (km \times jn)$.

Substituting into the chain, the gradient is given.

$$\begin{aligned}
    \nabla_\theta f &=\vartheta^\top \otimes_\text{Kr} X\\
    &= \operatorname{vec}(X^\top\vartheta^\top)
\end{aligned}$$

The gradient is broadcast back in its original shape by the
$\text{unvec}$ operation.

$$\nabla_\Theta f=\operatorname{unvec}(\nabla_\theta f)=X^\top\vartheta^\top$$

Substitute in Equation
[\[eq:loss_decomposition\]](#eq:loss_decomposition){reference-type="ref"
reference="eq:loss_decomposition"},

$$\begin{aligned}
    \nabla_{\Theta}\mathcal{L}_t&=eX^\top \vartheta^\top\\
    &=X^\top e\vartheta^\top\\
\end{aligned}$$

It is now possible to calculate $G$. For this, it is necessary to
introduce the subscripts identifying different tasks $t$, $t'$ into the
notation.

$$\begin{aligned}
G &= \langle \nabla_{\Theta} \mathcal{L}_t, \nabla_{\Theta} \mathcal{L}_{t'} \rangle_F \\
&=\langle X_t^\top e_t\vartheta_t^\top, X_{t'}^\top e_{t'}\vartheta_{t'}^\top\ \rangle_F\\
&=\underbrace{e_t e_{t'}}_K\langle X_t^\top \vartheta_t^\top, X_{t'}^\top \vartheta_{t'}^\top\ \rangle_F\\
&=K\langle X_t,X_{t'}\rangle \langle \vartheta_t,\vartheta_{t'}\rangle && \text{using the identity \ref{frob_id_1}}
\end{aligned}$$

In expectation, $G$ is zero since
$\vartheta_t\overset{\text{iid}}{\sim}\mathcal{N}(0,\frac{1}{d})$ - but
the variance is non-zero.

$$\begin{split}
    \text{Var}[G]&=\mathbb{E}[G^2]-\mathbb{E}[G]^2
    \\
    &=\mathbb{E}[G^2]
    \\
    &=K^2\langle X_t,X_{t'}\rangle^2 \mathbb{E}[\langle \vartheta_{t},\vartheta_{t'}\rangle^2]
    \\
    &=K^2\langle X_t,X_{t'}\rangle^2 \text{Var}[\langle \vartheta_t,\vartheta_{t'}\rangle]
    \\
    &=K^2\langle X_t,X_t'\rangle^2 \langle \text{Var}[\vartheta_t],\text{Var}[\vartheta_{t'}]\rangle
    \\
    &=K^2\langle X_t,X_t'\rangle^2 \operatorname{Var}[\vartheta_t] \operatorname{Var}[\vartheta_{t'}]\langle I_d \,I_d\rangle
    \\
    &=K^2\langle X_1,X_2\rangle^2\frac{1}{d^2}d\\
    &=K^2\langle X_1,X_2\rangle^2\frac{1}{d}
\end{split}
$$

The upper bound for catastrophic forgetting is given by its standard
deviation.
$$\mathcal{B}=\operatorname{std}[\text{CF}]=K\langle X_1,X_2\rangle\frac{1}{\sqrt{d}}$$

$$\boxed{
    \text{CF}\leq \mathcal{B} = \langle X_1,X_2\rangle\frac{K}{\sqrt{d}} \ \text{w.h.p}
    }$$

## One-Layer MLP

Experimentally, we observe worse forgetting compared to the linear case. Unfortunately, calculating the variance doesn't describe well the experiments as the variance decreases.

Since there is only one hidden layer, the subscript to distinguish the
layers is omitted. The mapping $f$ can be written equivalently for the
task batch $\textbf{X}_t$ or for a single sample $X_t$.

$$\begin{split}
    f(X_t)&=h\vartheta_t\\
    &=\sigma(z_t)\vartheta_t\\
    &=\sigma(X_t\Theta)\vartheta_t
\end{split}$$

The Jacobian $\ddagger$ for Equation
[\[eq:loss_decomposition\]](#eq:loss_decomposition){reference-type="ref"
reference="eq:loss_decomposition"} can be decomposed by applying the
chain rule and the vectorized parameter
$\theta=\operatorname{vec}(\Theta)$

$$\label{eq:J_oneMLP_decomposition}
    \nabla_\theta f=
    \underbrace{\nabla_h f}_{\in\mathbb{R}^{1\times d}}
    \cdot
    \underbrace{J_{h}(z)}_{\in\mathbb{R}^{d\times d}}
    \cdot
    \underbrace{J_z(\theta)}_{\in \mathbb{R}^{d \times md}}$$

Each component is calculated as follows.

$$\label{eq:J_oneMLP_components}
\begin{split}
&J_f(h) = \vartheta_t^\top \\
&J_h(z) = D \\
&J_z(\theta) = I_d \otimes_\text{Kr} X_t\\
\end{split}$$

$D$ is a diagonal matrix $d\times d$ whose diagonal elements are
*gates*, i.e. the derivative of the ReLU activations $\sigma'(z)$

$$g_t^{(j)} := \sigma'(z_t^{(j)})=
    \begin{cases}
    1, & z_t^{(j)}>0\\
    0, & \text{else}
    \end{cases}$$

As a consequence, the following property of $D$ holds
$$D^2 = DD^\top = D$$

The Kronecker product $\otimes_\text{Kr}$ produces a matrix $d\times md$
whose entries in $j,(i,k)$ are $\delta_{jk}X_t^{(i)}$

Plugging
[\[eq:J_oneMLP_components\]](#eq:J_oneMLP_components){reference-type="ref"
reference="eq:J_oneMLP_components"} into
[\[eq:J_oneMLP_decomposition\]](#eq:J_oneMLP_decomposition){reference-type="ref"
reference="eq:J_oneMLP_decomposition"}

$$\begin{aligned}
    \nabla_\theta f &= (\vartheta_t^\top D) \otimes_\text{Kr} X_t\\
    &=\operatorname{vec}(X_t^\top(\vartheta_t^\top D)) & \text{by } \ref{eq:kronecker_identity_1}
\end{aligned}$$

The gradient in its original shape ($m\times d$) is retrieved using the
$\operatorname{unvec}$ operation.

$$\begin{split}
    \nabla_\Theta f = \operatorname{unvec}(\nabla_\theta f) &= X_t^\top\vartheta_t^\top D_t
\end{split}$$

Substitute in
[\[eq:loss_decomposition\]](#eq:loss_decomposition){reference-type="ref"
reference="eq:loss_decomposition"}

$$\begin{aligned}
    \nabla_\theta\mathcal{L}_t &= e_t X_t^\top\vartheta_t^\top D_t\\
\end{aligned}$$

$$
\begin{align}
G &= \langle \nabla_\theta \mathcal{L}_t, \nabla_\theta \mathcal{L}_{t'} \rangle\\
&= e_t e_{t'}\langle X_t,X_{t'} \rangle \vartheta_t^\top D_t D_{t'}^\top\vartheta_{t'}
\end{align}
$$


## What did not work: Using the variance

Inserting
[\[eq:loss_grad_oneMLP_compact_S\]](#eq:loss_grad_oneMLP_compact_S){reference-type="ref"
reference="eq:loss_grad_oneMLP_compact_S"} into
[\[eq:cf_definition\]](#eq:cf_definition){reference-type="ref"
reference="eq:cf_definition"}, the CF is obtained.

$$\begin{aligned}
    G &:= \langle \nabla_{\Theta} \mathcal{L}_t, \nabla_{\Theta} \mathcal{L}_{t'} \rangle_F\\
    &= \langle X_t, X_{t'}\rangle \cdot\langle S_t, S_{t'}\rangle_F \\
    &= \langle X_t, X_{t'}\rangle \cdot\text{Tr}( S_t, S_{t'}^\top) \label{cf_MLP_der}\\
    %&= \text{Tr}(S_1^\top X_1 X_2^\top S_2)
\end{aligned}$$

$$\begin{aligned}
    G^2 &=  \langle X_t, X_{t'}\rangle^2 \cdot\text{Tr}( S_t, S_{t'}^\top)^2\\
    &= \langle X_t, X_{t'}\rangle^2 \cdot \sum _{i=1}^d \big(S_t^{(i)}\big)^2, \big(S_{t'}^{(i)}\big)^2\\
\end{aligned}$$

$$\begin{aligned}
    \mathbb{E}\Big[ \big(S_t^{(i)}\big)^2  \Big] &= \mathbb{E}\Big[ e_t^2 (g_t^{(i)}\big)^2 (\vartheta_t^{(i)}\big)^2  \Big]\\
    &= e_t^2\mathbb{E}\Big[  (g_t^{(i)}\big)^2\Big] \mathbb{E}\Big[ (\vartheta_t^{(i)}\big)^2  \Big] && \text{assuming } g_t \perp\vartheta_t\\
    &= e_t^2\mathbb{E}\Big[  g_t^{(i)}\Big] \operatorname{Var}\Big[ \vartheta_t^{(i)}\Big]\\
    &=e_t^2 \frac{1}{2}\frac{1}{d}
\end{aligned}$$

$$\begin{aligned}
\label{eq:var_MLP_der}
    \operatorname{Var}[G] &= \mathbb{E}[G^2]-\mathbb{E}[G]^2\\
    &= \mathbb{E}[G^2] \\
    &= \mathbb{E}\Big[ \langle X_t, X_{t'}\rangle^2 \cdot \sum _{i=1}^d \big(S_t^{(i)}\big)^2, \big(S_{t'}^{(i)}\big)^2\Big] \\
    &=\langle X_t, X_{t'}\rangle^2 \cdot \sum _{i=1}^d \mathbb{E}\Big[\big(S_t^{(i)}\big)^2\Big] \mathbb{E}\Big[\big(S_{t'}^{(i)}\big)^2\Big] && \text{assuming } S_t\perp S_{t'}\\
    &= \langle X_t, X_{t'}\rangle^2 \cdot \sum _{i=1}^d e_t^2\frac{1}{2d}e_{t'}^2\frac{1}{2d}\\
    &= \langle X_t, X_{t'}\rangle^2 \underbrace{e_t^2 e_{t'}^2}_{K^2}\frac{1}{4d^2} \sum _{i=1}^d 1 \\
    &= \langle X_t, X_{t'}\rangle^2 K^2 \frac{1}{4d}
\end{aligned}$$

$$\begin{aligned}
    \mathcal{B}:=\operatorname{std}[G]&=\sqrt{\operatorname{Var}[G]}\\
    &=\langle X_t, X_{t'}\rangle \frac{K}{2\sqrt{d}}
\end{aligned}$$

$$\begin{aligned}
    \boxed{
    \text{CF}\leq \mathcal{B} = \langle X_t, X_{t'}\rangle \frac{K}{2\sqrt{d}} \ \text{ w.h.p}
    }
\end{aligned}$$
