\section{Experimental Setup}

We study catastrophic forgetting in a teacher-student learning framework using the MNIST dataset partitioned into two tasks: Task 0 (digits 0-4) and Task 1 (digits 5-9). We evaluate four neural network architectures without bias terms: Linear models, SimpleMLP (one hidden layer), TwoLayerMLP, and ThreeLayerMLP, all using ReLU activations.

Our teacher-student paradigm involves training two specialized teachers on separate tasks, followed by sequential training of a dual-head student model. The student shares a common feature extraction backbone with task-specific output heads: $f_i(x) = W_{hi} \cdot \text{backbone}(x)$ for $i \in \{0,1\}$. A critical constraint is that \textbf{all output heads are frozen during training} ($\texttt{requires\_grad} = \texttt{False}$), forcing networks to learn task-specific representations solely through feature layer modifications.

All models use identical Kaiming uniform initialization with the same random seed. Training employs SGD with learning rate $\eta = 0.01$, batch size 64, and cross-entropy loss. Teachers train for 50 epochs on their respective tasks, while the student trains sequentially: 50 epochs on Task 0 (mimicking Teacher 0), then 50 epochs on Task 1 (mimicking Teacher 1).

We conduct comprehensive parameter sweeps over hidden dimensions $d \in \{10, 20, ..., 2000\}$ and 10 random seeds, yielding 200 configurations per architecture. This enables robust statistical analysis of catastrophic forgetting phenomena across varying model capacities and initializations.
