\section{Setup}

\subsection{Dataset and Task Definition}

We conduct our experiments on the MNIST dataset~\cite{lecun1998mnist}, which consists of 28×28 grayscale images of handwritten digits (0-9). The dataset is partitioned into two distinct tasks to study catastrophic forgetting in a teacher-student learning framework. We explore multiple task configurations:

\begin{itemize}
    \item \textbf{Standard Split}: Task 0 (digits 0-4) vs Task 1 (digits 5-9), each with 5 classes
    \item \textbf{Variable Class Numbers}: Tasks with varying numbers of classes controlled by parameter $C \in \{2, 3, 4, 5, 6, 7, 8, 9, 10\}$
    \item \textbf{Overlapping Tasks}: Both tasks containing all 10 classes but in different orders to study representation interference
\end{itemize}

Each image is flattened to a 784-dimensional vector and normalized to the range [0, 1]. The dataset is randomly split into training (60\%), validation (20\%), and test (20\%) sets using a fixed random seed for reproducibility. To introduce controlled task similarity variations, we optionally apply partial permutation to the input features with parameter $\alpha \in [0, 1]$, where $\alpha = 0$ corresponds to no permutation and $\alpha = 1$ to complete permutation of pixel positions.

\subsection{Model Architectures}

We evaluate four neural network architectures with configurable bias terms and activation functions:

\begin{enumerate}
    \item \textbf{Linear Model}: A single linear transformation from input to output:
    \begin{equation}
        f(x) = W_h W_1 x + b_h + b_1
    \end{equation}
    where $W_1 \in \mathbb{R}^{d \times 784}$ and $W_h \in \mathbb{R}^{C \times d}$, with $d$ being the hidden dimension and $C$ the number of classes. Bias terms $b_1, b_h$ are optional.

    \item \textbf{SimpleMLP}: A two-layer network with configurable activation:
    \begin{equation}
        f(x) = W_h \sigma(W_1 x + b_1) + b_h
    \end{equation}
    where $\sigma$ is the activation function (typically ReLU).

    \item \textbf{TwoLayerMLP}: A three-layer network with two activations:
    \begin{equation}
        f(x) = W_h \sigma(W_2 \sigma(W_1 x + b_1) + b_2) + b_h
    \end{equation}
    where $W_2 \in \mathbb{R}^{d \times d}$.

    \item \textbf{ThreeLayerMLP}: A four-layer network with three activations:
    \begin{equation}
        f(x) = W_h \sigma(W_3 \sigma(W_2 \sigma(W_1 x + b_1) + b_2) + b_3) + b_h
    \end{equation}
    where $W_3 \in \mathbb{R}^{d \times d_2}$ and $d_2$ is a secondary hidden dimension.
\end{enumerate}

All models support configurable activation functions and optional bias terms, with most experiments conducted using ReLU activations and bias-free configurations for analytical tractability.

\subsection{Teacher-Student Framework}

Our experimental setup follows a sequential teacher-student learning paradigm designed to study catastrophic forgetting:

\subsubsection{Teacher Models}
We train two separate teacher models, each specialized on one task:
\begin{itemize}
    \item \textbf{Teacher 0}: Trained exclusively on Task 0 data (digits 0-4)
    \item \textbf{Teacher 1}: Trained exclusively on Task 1 data (digits 5-9)
\end{itemize}
Both teachers use the same architecture as specified above, with output dimensions corresponding to 5 classes each.

\subsubsection{Student Model}
The student model employs a dual-head architecture sharing a common feature extraction backbone:
\begin{equation}
    \begin{aligned}
        h(x) &= \text{backbone}(x) \\
        f_0(x) &= W_{h0} h(x) \\
        f_1(x) &= W_{h1} h(x)
    \end{aligned}
\end{equation}
where $W_{h0}, W_{h1} \in \mathbb{R}^{5 \times d}$ are the output heads for Task 0 and Task 1, respectively. During training, only one head is active at a time, determined by the current task.

\subsection{Initialization and Frozen Output Heads}

\subsubsection{Weight Initialization}
We employ multiple initialization strategies depending on the experimental focus:

\begin{itemize}
    \item \textbf{Kaiming Uniform}: For ReLU networks, weights are initialized as:
    \begin{equation}
        W_{ij} \sim \mathcal{U}\left(-\sqrt{\frac{6}{n_{\text{in}}}}, \sqrt{\frac{6}{n_{\text{in}}}}\right)
    \end{equation}
    
    \item \textbf{Uniform Initialization}: For controlled initialization studies, weights are drawn from:
    \begin{equation}
        W_{ij} \sim \mathcal{U}(-a, a)
    \end{equation}
    where $a \in \{0.1, 0.5, 1.0, 5.0\}$ is the uniform bound parameter.
\end{itemize}

All models (teachers and student) are initialized with the same random seed to ensure identical starting conditions. For student models, we investigate two head initialization strategies: identical initialization (\texttt{same\_head = True}) where both output heads start with identical weights, and independent initialization (\texttt{same\_head = False}).

\subsubsection{Frozen Output Heads}
A critical aspect of our setup is that \textbf{all output heads are frozen during training}. Specifically:
\begin{itemize}
    \item Teacher output layers $W_h$ have $\texttt{requires\_grad} = \texttt{False}$
    \item Student output heads $W_{h0}$ and $W_{h1}$ have $\texttt{requires\_grad} = \texttt{False}$
\end{itemize}
This constraint forces the networks to learn task-specific representations solely through modifications to the feature extraction layers, providing a controlled setting to study representation learning and interference.

\subsection{Training Protocol}

\subsubsection{Optimization}
All models are trained using Stochastic Gradient Descent (SGD) with the following hyperparameters:
\begin{itemize}
    \item Learning rate: $\eta = 0.01$
    \item Momentum: $\mu = 0.0$ (standard SGD)
    \item Batch size: 64
    \item Loss function: Cross-entropy loss
\end{itemize}

\subsubsection{Training Schedule}
The training follows a sequential protocol:
\begin{enumerate}
    \item \textbf{Teacher Training}: Each teacher is trained for 50 epochs on its respective task
    \item \textbf{Student Training}: The student is trained sequentially:
    \begin{itemize}
        \item 50 epochs on Task 0 (using Teacher 0's labels, head 0 active)
        \item 50 epochs on Task 1 (using Teacher 1's labels, head 1 active)
    \end{itemize}
\end{enumerate}

During student training, the model learns to mimic the teacher's predictions rather than the ground truth labels, implementing a form of knowledge distillation in the sequential learning setting.

\subsection{Experimental Design}

\subsubsection{Parameter Sweeps}
We conduct focused parameter sweeps to study multiple aspects of catastrophic forgetting:

\begin{itemize}
    \item \textbf{Hidden dimensions}: $d \in \{10, 50, 100, 200\}$ for computational efficiency while covering key capacity regimes
    \item \textbf{Random seeds}: $\{1, 2, 3, 4, 5\}$ for statistical reliability
    \item \textbf{Number of classes per task}: $C \in \{2, 3, 4, 5, 6, 7, 8, 9, 10\}$ to study task complexity effects
    \item \textbf{Initialization bounds}: $a \in \{0.1, 0.5, 1.0, 5.0\}$ for uniform initialization studies
    \item \textbf{Student head initialization}: $\{\text{same}, \text{different}\}$ to study representation sharing
    \item \textbf{Task similarity}: $\alpha \in [0, 1]$ via input permutation parameter
\end{itemize}

This multi-dimensional parameter space enables systematic analysis of how model capacity, initialization strategy, task complexity, and task similarity interact to influence catastrophic forgetting phenomena.

\subsubsection{Evaluation Metrics}
For each configuration, we track:
\begin{itemize}
    \item Task-specific accuracy on validation sets after each training phase
    \item Cross-entropy loss for both heads throughout training
    \item Catastrophic forgetting metrics comparing performance before and after sequential training
    \item Gradient accumulation patterns and sharpness measures using SAM~\cite{foret2020sharpness}
\end{itemize}

All experiments are conducted using PyTorch Lightning~\cite{falcon2019pytorch} with deterministic training enabled for reproducibility. The experimental framework supports both sequential execution and parallel execution via SLURM job arrays for efficient large-scale parameter sweeps.
