# Grid Search Workflow

This note documents the perturbation experiment implemented in `experiments/perturbation/grid_search.py` and spells out the key formulas and assumptions.

## Notation

- Linear layers are indexed by \(\mathcal{L}\). Each layer \(\ell \in \mathcal{L}\) has weight matrix \(W_\ell \in \mathbb{R}^{d_{\text{out}}^{(\ell)} \times d_{\text{in}}^{(\ell)}}\).
- Ranks to evaluate form a set \(\mathcal{R}\); each perturbation uses some \(r \in \mathcal{R}\).
- Trials are indexed by \(t = 0, \ldots, N_{\text{trials}}-1\).
- A target Frobenius norm \(T_{t,r}\) is either supplied explicitly via `--target-norm` or sampled uniformly from a configurable interval `[T_{\min}, T_{\max}]` on every run.
- The evaluation loss (mean log-perplexity) for model parameters \(\theta\) is denoted \(\mathcal{L}(\theta)\); the corresponding perplexity is \(\mathrm{PPL}(\theta) = e^{\mathcal{L}(\theta)}\).

## Baseline pass

The script loads the dataset, tokenizer, and model, then computes a baseline loss and perplexity:

\[
\mathcal{L}_0 = \mathcal{L}(\theta_0), \qquad \mathrm{PPL}_0 = e^{\mathcal{L}_0}.
\]

The unperturbed parameters \(\theta_0\) are cached so the model can be restored before each perturbation.

## Sampling nested noise bases

For each trial \(t\), a single random seed drives *nested* Gaussian bases up to the maximum rank \(R_{\max} = \max \mathcal{R}\):

\[
A_\ell \sim \mathcal{N}(0, 1)^{d_{\text{out}}^{(\ell)} \times R_{\max}},
\qquad
B_\ell \sim \mathcal{N}(0, 1)^{R_{\max} \times d_{\text{in}}^{(\ell)}}.
\]

These matrices live on the CPU in float32. Because smaller ranks reuse the leading columns/rows, the perturbations are nested: for rank \(r\), the layerwise raw update is

\[
\Delta^{(r)}_\ell = A_\ell[:, 0{:}r] \; B_\ell[0{:}r, :].
\]

Sharing bases within a trial ensures that different ranks explore consistent directions; only the truncation depth changes.

## Targeted Frobenius rescaling

Before modifying the model, the script forms the global raw perturbation magnitude for \((t, r)\):

\[
\|\Delta^{(r)}\|_F = \sqrt{\sum_{\ell \in \mathcal{L}} \left\| \Delta^{(r)}_\ell \right\|_F^2 }.
\]

If the user has not provided a fixed norm, a target \(T_{t,r}\) is sampled uniformly from `[T_{\min}, T_{\max}]` (defaults: `[0, 1000]`). The actual update scales every layer by a shared factor

\[
\alpha_{t,r} =
\begin{cases}
\dfrac{T_{t,r}}{\|\Delta^{(r)}\|_F}, & \text{if } \|\Delta^{(r)}\|_F > 0, \\
0, & \text{otherwise},
\end{cases}
\qquad
W_\ell \leftarrow W_\ell + \alpha_{t,r} \; \Delta^{(r)}_\ell.
\]

Hence the achieved perturbation magnitude is \(\|\alpha_{t,r} \Delta^{(r)}\|_F \approx T_{t,r}\) up to numerical error.

## Evaluation metrics per run

After applying the perturbation, the script recomputes loss and perplexity:

\[
\mathcal{L}_{t,r} = \mathcal{L}(\theta_0 + \alpha_{t,r} \Delta^{(r)}),
\qquad
\mathrm{PPL}_{t,r} = e^{\mathcal{L}_{t,r}}.
\]

The recorded deltas are

\[
\Delta \mathcal{L}_{t,r} = \mathcal{L}_{t,r} - \mathcal{L}_0,
\qquad
\Delta \mathrm{PPL}_{t,r} = \mathrm{PPL}_{t,r} - \mathrm{PPL}_0.
\]

To approximate rank sensitivity, the script also logs the first-order slope with respect to perturbation size:

\[
S_{t,r} = \frac{\Delta \mathcal{L}_{t,r}}{\|\alpha_{t,r} \Delta^{(r)}\|_F} = \frac{\mathcal{L}_{t,r} - \mathcal{L}_0}{T_{t,r}}.
\]

All per-run statistics—including the applied rank, sampled target norm, achieved norm, scale factor \(\alpha_{t,r}\), and sensitivity—are appended to a JSON file for later plotting.

## Loop structure summary

For each trial \(t\):

1. Restore the baseline parameters \(\theta_0\).
2. Sample \(\{A_\ell, B_\ell\}_{\ell \in \mathcal{L}}\) once at \(R_{\max}\).
3. For every \(r \in \mathcal{R}\):
   1. Restore \(\theta_0\).
   2. Determine \(T_{t,r}\) (either user-provided or sampled from the configured range).
   3. Form the nested update \(\Delta^{(r)}\) using the pre-sampled bases.
   4. Scale by \(\alpha_{t,r} = T_{t,r} / \|\Delta^{(r)}\|_F\) and apply to all linear layers.
   5. Measure \(\mathcal{L}_{t,r}\), \(\mathrm{PPL}_{t,r}\), and record the derived metrics.

Because noise bases are shared within a trial, perturbations at different ranks explore aligned directions; the only variation arises from the truncation depth and the sampled target magnitude.
