
# Linear model
$$
\mathcal{B}=\operatorname{std}[\text{CF}]=K\langle X_1,X_2\rangle\frac{1}{\sqrt{d}}
$$

## Effect of tasks similarity

### MNIST semantic split
Split semantically the MNIST into two tasks using the following grouping

- "half"  : [0, 1, 2, 3, 4],[5, 6, 7, 8, 9]
- "round" : [0, 3, 6, 8, 9],[1, 2, 4, 5, 7]
- "top"   : [0, 2, 3, 8, 9],[1, 4, 5, 6, 7]
- "equal" : [0, 1, 2, 3, 4],[0, 1, 2, 3, 4]

But the calculated similarity with the Frobeneus product doesn't change much and so the theoretical forgetting. This was validated with experiments.

![alt text](image.png)

Validated with the permuted MNIST (similarity $\sim \alpha^{-1}$)

![alt text](image-1.png)

What is the limit case for $d=1$?

#### Next steps
- Derive analytical expression for ReLU
- Add gradient based forgetting


# Non- linear model

- other activation (GELU / ELU ..)
- other datasets (permited MNIST, CIFAR10) other datasets (not image classification)


# Other things

- extend to 3 consecutive tasks
- find realistic example where tasks are very different
- normal training switch (no teacher-student)
- training the heads as well

