# Comprehensive Test Plan v3.0
*Generated: April 2026 | Covers all C1, C3, C4, C5 experiments across all models and term sets*

---

## Pre-Read: Clarifications on Desired Scope

### 1. "57 terms vs 45 unique terms"

**57 is a ghost number.** It is the gross row count if you sum all dataset files without deduplication:

| File | Terms | New unique | Overlap |
|---|---|---|---|
| `pilot_terms.jsonl` | 3 | 3 | 0 |
| `expanded_terms.jsonl` | 6 | 6 | 0 |
| `expanded_terms_100.jsonl` | 9 | 2 | 7 reused from above |
| `expanded_terms_fewshot.jsonl` | 6 | 2 | 4 reused from Set A |
| `expanded_terms_tier123.jsonl` | 21 | 20 | 1 (`keyboard navigation`) |
| `expanded_terms_wave2.jsonl` | 12 | 12 | 0 |
| **Total** | **57 slots** | **45 unique** | **12 wasted** |

The **canonical vocabulary is 45 unique terms**. The plan uses this number exclusively.

### 2. What V1/V2/V3/V4 means in this plan

Two usages — both defined here to avoid ambiguity:

**Term-set versions** (scope of each experiment):
| Label | N | Terms | Status |
|---|---|---|---|
| **T-V1** | 3 | Pilot: screen reader, skip link, alt text | All prompts exist ✅ |
| **T-V2** | 9 | Set B: T-V1 + aria attribute, color contrast, focus indicator, heading structure, keyboard navigation, landmark region | All prompts exist ✅ |
| **T-V3** | 21 | Tier-1/2/3: 20 new terms + keyboard navigation | All prompts exist ✅ |
| **T-V4** | 12 | Wave-2: contrast ratio, eye tracking, time limits, reduced motion, focus trap, sign language, touch target size, haptic feedback, plain language, motion sensitivity, semantic html, orientation support | All prompts exist ✅ |

> **Note on T-V2A (Set A):** The 9-term C1/C4 lifecycle uses tab order + form validation instead of keyboard navigation + landmark region. This is a legacy split documented in `paper/sections/methods.md §3.2`.

**Ablation conditions** within C5 (the V1–V4 the user likely also means):
| Label | Condition |
|---|---|
| **A-V1** | Top-k binding heads ablated (main effect) |
| **A-V2** | Random k heads ablated ×5 draws (null baseline) |
| **A-V3** | Bottom-k binding heads ablated (discriminant control) |
| **A-V4** | No ablation (baseline performance; zero intervention) |

---

## Review: What to Expand, What Not to Do

### ✅ Expand

| Item | Rationale |
|---|---|
| **C5 scope → T-V3 + T-V4 (33 terms, N=165 rec)** | Larger N improves specificity reliability. More terms = tighter CI on the Δ(top − random) estimate. T-V4 (wave-2) covers new categories (Mobile, Vestibular, Sensory) not tested in any prior C5 run. |
| **C1/C4 on Stanford CRFM GPT-2 (5 seeds)** | Enables proper significance testing across seeds — the only way to distinguish true lifecycle signal from random variation. First reproducibility test of the lifecycle claim. |
| **C1/C4 on SmolLM3-3B** | First multilingual model; if lifecycle signal appears in multilingual data, the claim has substantially stronger generalizability. |
| **C5 on Qwen2.5-1.5B** | Answers the reviewer question "did you test modern models?" Causal ablation at a single checkpoint is sufficient. TL native support = zero infra cost. |
| **Canonical term unification** | All models must use the same T-V2/T-V3/T-V4 sets. Currently Set A ≠ Set B — this created comparison problems. Going forward: all models use T-V2 (Set B) for C1/C4 too (after tokenization audit confirms validity). |

### ❌ Do Not Expand / Restrictions

| Item | Rationale |
|---|---|
| **C1/C4 between-term Spearman on T-V3 + T-V4** | The *original* cross-sectional test requires between-term EB* variance. Tier123 terms cluster at 0.60–0.77 — no spread, Spearman collapses. This is not fixed by adding more uniform terms. However, **C1/C4 are redesigned (not abandoned)** — see Phase 1 §C1-B / Phase 3 §C4-B for the within-term temporal precedence test that generalises to all 45 terms without requiring between-term spread. |
| **C3 on T-V3 + T-V4 (beyond 9 terms)** | Few-shot exemplar prompts do not exist for the 36 terms beyond T-V2. The few-shot format requires carefully authored 2–3 examples per term that actually demonstrate understanding — the template-generated recognition/generation prompts in tier123/wave2 files are not suitable as few-shot examples. Expand C3 only after authoring exemplars. |
| **Full lifecycle on Qwen2.5-1.5B** | No intermediate checkpoints. Structurally impossible. Only C3 + C5 are viable. Do not attempt workarounds. |
| **C5 on LLM360 AMBER-7B (7B)** | Requires ~14GB VRAM minimum. GPU budget on Lightning Studio is tight. Deprioritize unless compute scales up. |
| **Treating Set A and Set B as equivalent in cross-model comparison** | They differ in 2 terms. Cross-model C1/C4 comparisons involving these two sets are **approximate only** and must be explicitly flagged. The plan standardizes on Set B (T-V2) going forward. |
| **Running all 154 Pythia checkpoints** | Strategic 8-checkpoint selection (step 0, 15k, 30k, 60k, 90k, 120k, 140k, 143k) already captures the full lifecycle curve. More checkpoints add computational cost with diminishing analytical return. |

---

## Phase 0: Infrastructure (Blockers — Must Complete First)

### 0.1 Cross-Tokenizer Audit for All 45 Terms

Run tokenization audit for **all 45 terms** across all model tokenizers to confirm multi-token validity before committing them to experiments. Terms that collapse to 1 token in any target model's tokenizer must be flagged and excluded from that model's test.

| Model | Tokenizer | Status |
|---|---|---|
| Pythia (all sizes) | GPT-NeoX | ✅ Previously audited for T-V2 terms |
| OLMo-1B | Dolma BPE | ✅ Previously audited for T-V2 terms |
| CRFM GPT-2 Small | GPT-2 | ❌ Pending — run for T-V3 + T-V4 |
| SmolLM3-3B | LLaMA-3 BPE | ❌ Pending — run for all 45 |
| Qwen2.5-1.5B | Qwen tiktoken | ❌ Pending — run for all 45 |

**Action:** Extend `src/tokenization_audit.py` to accept an arbitrary term list and model name.

### 0.2 Checkpoint Alignment Table

Map equivalent "lifecycle stages" across models so that checkpoint comparisons are conceptually valid. Absolute step numbers are not comparable — use token count as the universal unit.

| Stage | Pythia (tokens) | OLMo (tokens) | CRFM GPT-2 (tokens) | SmolLM3 (tokens) |
|---|---|---|---|---|
| Init | 0 | 0 | 0 | 0 |
| Early | ~31B (step 15k) | ~7B (step 15k) | ~209M (step 100) | ~94B (step 40k) |
| Mid | ~250B (step 120k) | ~50B (step ~100k) | ~4B (step 2k) | ~1.9T (step 800k) |
| Late | ~300B (step 143k) | ~60B (step ~120k) | ~4.2B (step 2k+) | ~2.4T (step 1M+) |

**Action:** Finalize the token-count alignment and select 8 checkpoints per model at matched token milestones.

### 0.3 Unified Prompt Register

Create `data/prompts/canonical_45terms.jsonl` — a single deduplicated file containing all prompts for all 45 terms, sourced from the existing files. This becomes the single source of truth for all experiments.

```
pilot_terms.jsonl         →  3 terms
expanded_terms_100.jsonl  →  9 terms  (Set B; replaces expanded_terms.jsonl)
expanded_terms_tier123.jsonl → 21 terms
expanded_terms_wave2.jsonl   → 12 terms
Total: 45 unique terms, ~11 prompts/term = ~495 prompts
```

> `ARIA label` and `semantic HTML` from `expanded_terms_fewshot.jsonl` are C3-example-only terms (not full prompt sets) — exclude from the canonical C5/C1 prompt file but retain in the fewshot context.

### 0.4 Author C3 Few-Shot Exemplars for T-V3/T-V4 (Optional, Separate Track)

If C3 expansion beyond T-V2 is desired, 2 high-quality few-shot exemplars must be hand-authored for each of the 33 new terms (T-V3 + T-V4). This is a content-authoring task, not a coding task. Estimated effort: 4–6 hours. **Do not block main experiments on this.**

---

## Phase 1: C1 — Binding-Behavior Lead-Lag Correlation

**Claim:** EB* rise during training temporally precedes the emergence of behavioral competence. This is validated at two levels: (A) cross-term ranking on a diverse 9-term pilot set, and (B) within-term temporal precedence generalised to all 45 terms.

---

### C1-A: Between-Term Spearman (Original, 9 Terms)

**Design:** At each checkpoint, rank 9 terms by EB* and rank by behavioral score. Compute Spearman ρ across terms. Requires between-term EB* variance — held for the 9-term pilot set.

**Scope:** T-V2A (9 terms, Set A) for Pythia; T-V2 (9 terms, Set B) for all new models. 8 checkpoints.

| Model | Has CK | Seeds | Term set | Runs |
|---|---|---|---|---|
| Pythia-160M | ✅ | 1 | T-V2A (9) | 9 × 8 × 2 = **144** |
| Pythia-1B | ✅ | 1 | T-V2A (9) | **144** |
| Pythia-2.8B | ✅ | 1 | T-V2A (9) | **144** |
| OLMo-1B | ✅ | 1 | T-V2 (9) | **144** |
| CRFM GPT-2 Small | ✅ | **5** | T-V2 (9) | 9 × 8 × 5 × 2 = **720** |
| SmolLM3-3B | ✅ | 1 | T-V2 (9) | **144** |
| Qwen2.5-1.5B | ❌ | — | — | **SKIP** |
| **C1-A TOTAL** | | | | **1,440 runs** |

**Protocol:**
1. Extract EB* at 8 checkpoints per term per model
2. Evaluate Beh (RecAcc + GenScore) at 8 checkpoints per term per model
3. At each checkpoint k: Spearman(EB* ranks, Beh ranks) across 9 terms → ρ_k
4. Plot ρ_k across training; early positive = C1 supported

**Outputs:**
- `data/results/binding/{model}_{seed}_step{N}_binding.json`
- `data/results/behavioral/{model}_{seed}_step{N}_behavioral.json`

---

### C1-B: Within-Term Temporal Precedence (Redesigned, All 45 Terms)

**Design:** For each term *t* individually, test whether EB*(t) at checkpoint k predicts Beh(t) at checkpoint k+1 better than Beh(t) at k predicts EB*(t) at k+1. This is a per-term lead indicator test — no between-term variance required.

**Scope:** All 45 terms, same 8 checkpoints, same models. Binding and behavioral data is **shared with C1-A** — the only additional cost is extracting the 36 new terms (T-V3 + T-V4) across 8 checkpoints (already needed partially for C5).

| Model | Has CK | New runs (36 new terms × 8 ck × 2) |
|---|---|---|
| Pythia-160M | ✅ | **576** |
| Pythia-1B | ✅ | **576** |
| Pythia-2.8B | ✅ | **576** |
| OLMo-1B | ✅ | **576** |
| CRFM GPT-2 Small | ✅ | 576 × 5 seeds = **2,880** |
| SmolLM3-3B | ✅ | **576** |
| **C1-B additional TOTAL** | | **5,760 runs** |

> C1-A runs (9 terms × 8 ck) are a subset of C1-B data — no duplication. Total unique lifecycle runs across C1-A + C1-B = 45 × 8 × 2 per model.

**Protocol:**
1. For each term t across 8 checkpoints: compute forward lag correlation
   - `r_forward(t)` = Pearson(EB*(t, ck_0:6), Beh(t, ck_1:7))  — EB* leads Beh by 1 step
   - `r_backward(t)` = Pearson(Beh(t, ck_0:6), EB*(t, ck_1:7)) — Beh leads EB* by 1 step
   - `lead_indicator(t)` = 1 if r_forward > r_backward, else 0
2. Population test: binomial test on {lead_indicator(t)} across all 45 terms
   - H1: P(lead) > 0.5 (EB* more often leads than lags)
   - With N=45: 60% positive → z ≈ 1.34; 70% → z ≈ 2.68 (p < 0.01)
3. Report: % of terms where EB* leads Beh, mean r_forward − r_backward, per model

**Interpretation:**
- If ≥60% of 45 terms show EB* leading: C1 generalises beyond the pilot term set
- If <50%: lifecycle may be term-selective — characterise which term properties predict lead direction
- Either result is publishable (see Paper Narrative section)

**Outputs:**
- `data/results/c1b/{model}_within_term_lead.csv` — one row per term: r_forward, r_backward, lead_indicator
- `data/results/c1b/{model}_population_test.json` — binomial test result, mean lead diff

---

## Phase 2: C3 — Few-Shot Unlockability

**Claim:** Few-shot examples unlock latent accessibility knowledge that zero-shot probing cannot access, especially at early checkpoints. This gap narrows at late checkpoints as behavioral competence emerges.

### Scope
- **Term set:** T-V2 (9 terms, Set B) — few-shot exemplars exist for these
- **Checkpoints:** 2 per model — early (≈step 15k or equivalent) + late (final or ≈step 143k)
- **Conditions:** Zero-shot (baseline) + few-shot (2-example prompt)

### Models and Runs

| Model | Has CK | Seeds | Runs (zero+few × 2 ck × 9 terms) |
|---|---|---|---|
| Pythia-160M | ✅ | 1 | **36** |
| Pythia-1B | ✅ | 1 | **36** |
| Pythia-2.8B | ✅ | 1 | **36** |
| OLMo-1B | ✅ | 1 | **36** |
| CRFM GPT-2 Small | ✅ | **5** | **180** |
| SmolLM3-3B | ✅ | 1 | **36** |
| Qwen2.5-1.5B | ❌ | 1 | 36 (final checkpoint only for both "early"/"late" — compare across sizes instead) |
| **TOTAL** | | | **396 runs** |

### Protocol
1. Run `eval_few_shot_c3.py` (already built) for each model × checkpoint × condition
2. Δfew-shot = few-shot score − zero-shot score per term per checkpoint
3. Expected: Δ is large at early checkpoint, shrinks at late checkpoint (knowledge becomes self-sufficient)

### Outputs
- `data/results/c3/crfm_gpt2_sm_{seed}_step{N}_c3.json`
- `data/results/c3/smollm3_step{N}_c3.json`
- `data/results/c3/qwen25_1.5b_final_c3.json`

---

## Phase 3: C4 — Binding-Behavior Decoupling

**Claim:** The coupling between EB* and behavioral competence weakens or reverses at late training checkpoints, indicating representations have distributed beyond binding heads.

**No new data collection.** C4 reuses all binding + behavioral data from Phase 1. This phase is analysis only.

---

### C4-A: Between-Term Spearman Sign Flip (Original, 9 Terms)

- Split lifecycle data into early window (ck0–ck3) and late window (ck4–ck7)
- Report: early ρ (coupling), late ρ (decoupling), sign change per model
- CRFM GPT-2 Small: report mean ± SD across 5 seeds
- Expected: small models (160M) show persistent coupling; large models (1B, 2.8B, SmolLM3-3B) show late decoupling

---

### C4-B: Within-Term Decoupling (Redesigned, All 45 Terms)

- For each term t, compute within-term Spearman correlation in early window and late window independently
  - `rho_early(t)` = Spearman(EB*(t, ck0:3), Beh(t, ck0:3))  [4 points]
  - `rho_late(t)`  = Spearman(EB*(t, ck4:7), Beh(t, ck4:7))  [4 points]
  - `decouple(t)`  = 1 if rho_early > 0 AND rho_late ≤ 0
- Report: fraction of 45 terms showing decoupling per model
- Expected: decoupling fraction is higher in large models than small models
- This test directly captures the **model-scale dependence** of decoupling without relying on between-term variance

**Outputs:**
- `data/results/c4b/{model}_within_term_decoupling.csv` — one row per term: rho_early, rho_late, decouple flag
- Summary table for `paper/sections/results.md §4.4`

---

## Phase 4: C5 — Causal Ablation

**Claim:** Ablating top-binding heads causes a larger performance drop than ablating random or bottom-binding heads (specificity > 0), confirming binding heads are causally necessary for accessibility concept representation.

### Scope
- **Term set:** T-V2 (9 Set B) + T-V3 (21 tier123, 20 new) + T-V4 (12 wave-2) = **41 terms, N=205 recognition prompts**
- **Critical design principle:** T-V2 (the 9 C1/C4/C3 terms) is explicitly included in C5. This gives a 9-term coherent core where all four experiments are run on the same terms, enabling a complete per-term mechanistic narrative. The additional 32 terms extend statistical power for the specificity estimate.
- **Why not T-V1 separately:** T-V1 pilot terms are already inside T-V2 (Set B), so they are covered.
- **N=205 rec prompts** reduces per-prompt noise to ±0.70 pp.
- **Checkpoint:** Late trained (final checkpoint or equivalent)

### Ablation Conditions (A-V1 through A-V4)

| Condition | Description | Purpose |
|---|---|---|
| **A-V4** | No ablation (baseline) | Establishes baseline performance |
| **A-V1** | Top-k binding heads ablated | Main effect — tests causal necessity |
| **A-V2** | Random k heads ablated ×5 draws | Null baseline — controls for general head ablation effect |
| **A-V3** | Bottom-k binding heads ablated | Discriminant control — confirms effect is binding-specific, not any-head |

> **Specificity** = Δ(A-V4 → A-V1) − mean Δ(A-V4 → A-V2). Positive = binding heads causally specific.

### Models and Runs (forward passes)

| Model | TL Support | Seeds | (33+3)×5 prompts × 7 conditions | Total |
|---|---|---|---|---|
| Pythia-160M | ✅ native | 1 | 180 × 7 | **1,260** |
| Pythia-1B | ✅ native | 1 | **1,260** |
| Pythia-2.8B | ✅ native | 1 | **1,260** |
| OLMo-1B | ✅ HF hooks | 1 | **1,260** |
| CRFM GPT-2 Small | ✅ native | **5** | 180 × 7 × 5 | **6,300** |
| SmolLM3-3B | ✅ HF hooks | 1 | **1,260** |
| Qwen2.5-1.5B | ✅ native | 1 | **1,260** |
| **TOTAL** | | | | **~13,860 forward passes** |

### Protocol
1. Extract top-k and bottom-k binding heads at final checkpoint for each model
2. Run ablation under all 4 conditions (A-V1 through A-V4) using `run_causal_c5.py` and `run_causal_c5_olmo.py` patterns
3. Compute mean RecAcc and GenScore per condition
4. Compute specificity = Δ(top) − mean Δ(random)
5. Threshold: specificity > 0.10 = weakly supported; > 0.20 = supported

### Outputs
- `data/results/causal/crfm_gpt2_sm_{seed}_step{N}_causal.json`
- `data/results/causal/smollm3_final_causal.json`
- `data/results/causal/qwen25_1.5b_final_causal.json`

---

## Summary Matrix

```
               T-V1   T-V2   T-V3   T-V4    TOTAL
               (3t)   (9t)   (21t)  (12t)   terms
──────────────────────────────────────────────────
C1-A (between)   ✓     ✓      ✗      ✗       9 terms   ← original test
C1-B (within)    ✓     ✓      ✓      ✓      45 terms   ← redesigned generalisation
C3 few-shot      ✓     ✓      ✗*     ✗*      9 terms
C4-A (between)   ✓     ✓      ✗      ✗       9 terms   ← original test
C4-B (within)    ✓     ✓      ✓      ✓      45 terms   ← redesigned generalisation
C5 ablation      (✓)   ✓      ✓      ✓      41 terms

✗  = original test structurally cannot generalise (between-term variance collapses)
✗* = expandable if C3 few-shot exemplars are authored in a separate track
(✓) = T-V1 pilot terms are a subset of T-V2 Set B; covered implicitly

COHERENCE CORE: T-V2 (9 terms) is tested in ALL experiments (C1-A/B, C3, C4-A/B, C5).
All 45 terms are covered by C1-B, C4-B, and C5.
```

```
                     C1/C4   C3    C5
──────────────────────────────────────
Pythia-160M            ✓      ✓     ✓
Pythia-1B              ✓      ✓     ✓
Pythia-2.8B            ✓      ✓     ✓
OLMo-1B                ✓      ✓     ✓
CRFM GPT-2 Sm (5 sd)   ✓      ✓     ✓
SmolLM3-3B             ✓      ✓     ✓
Qwen2.5-1.5B           ✗      ✓     ✓
```

---

## Computational Budget Estimate

| Phase | Runs / Forward Passes | Notes |
|---|---|---|
| Phase 0 (infrastructure) | — | Tokenization, alignment, canon file |
| Phase 1 C1-A (9-term lifecycle) | 1,440 evaluations | Binding + behavioral, 9 terms × 8 ck |
| Phase 1 C1-B (45-term lifecycle) | +5,760 evaluations | 36 new terms × 8 ck × 2 (adds to C1-A data) |
| Phase 2 C3 | 396 runs | Zero-shot + few-shot at 2 checkpoints |
| Phase 3 C4-A/B | 0 extra | Reuses Phase 1 data; analysis only |
| Phase 4 C5 | ~13,860 forward passes | 7 ablation conditions × 205 prompts |
| **Grand total** | **~21,456** | Across all models, terms, conditions |

---

## Term Canonicalization Decision

The plan uses the following canonical term assignment going forward. All models test the **same terms** to enable direct benchmarking.

**Lifecycle models (C1/C4):** 9-term Set B (T-V2) — *standardized across all models including Pythia, replacing Set A going forward*
> Exception: Pythia Set A (T-V2A) historical results are preserved for continuity; new Pythia runs use Set B.

**C3 models (all):** T-V2 (9 terms) — same as C1/C4, giving full coherence.

**C5 models (all):** T-V2 + T-V3 new terms + T-V4 = 41 terms (Set B as coherence core + tier123 + wave2 for N)
> This is the critical change from the original plan. The 9 C1/C4/C3 terms are now explicitly inside C5, creating a 9-term coherent core with all 4 experiments.

**Identical cross-model benchmarking is guaranteed** once the Phase 0 tokenization audit confirms all 41 terms tokenize as ≥2 tokens in GPT-2 (CRFM), LLaMA-3 (SmolLM3), and Qwen2 (Qwen2.5) tokenizers.

---

## Execution Status

### ✅ Completed (no GPU required)

| Task | Script | Output |
|---|---|---|
| Phase 0a: Pythia 45-term tokenization audit | `src/tokenization_audit.py --pythia-only` | `data/tokenization/tokenization_table_45terms.csv` — all 45 terms ✅ |
| Phase 0a: CRFM/SmolLM3/Qwen tokenizer audit | `src/tokenization_audit.py --new-models-only` | `data/tokenization/tokenization_new_models_45terms.csv` — all 45 terms ✅ |
| Phase 0b: Canonical 41-term prompt register | `src/build_canonical_prompts.py` | `data/prompts/canonical_45terms.jsonl` (41 unique terms, 451 prompts) |
| C1-B analysis script (+ CRFM/SmolLM3 configs) | `src/analyze_c1b_within_term.py` | Per-term lead CSVs + population JSON |
| C4-B analysis script (+ CRFM/SmolLM3 configs) | `src/analyze_c4b_decoupling.py` | Per-term decoupling CSVs + summary JSON |
| CRFM extraction scripts | `src/extract_binding_crfm.py`, `src/eval_behavior_crfm.py` | 5 seeds × 8 ck × 9 terms |
| SmolLM3 extraction scripts + loader | `src/extract_binding_smollm3.py`, `src/eval_behavior_smollm3.py`, `src/utils_model_smollm3.py` | 8 ck × 9 terms |

### ✅ Final Results: C1-B (within-term EB* lead, 41 terms)

| Model | Terms | EB* leads | Lead % | Binomial p | Status |
|---|---|---|---|---|---|
| Pythia-160M | 41 | 3/41 | **7.3%** | 1.000 | ❌ Beh leads at small scale |
| Pythia-1B | 41 | 30/41 | **73.2%** | **0.0022** | ✅ EB* leads |
| Pythia-2.8B | 34 | 27/34 | **79.4%** | **0.0004** | ✅ EB* leads |
| OLMo-1B | 9 | 7/9 | **77.8%** | 0.090 | ✅ EB* leads (N=9, marginal) |
| CRFM GPT-2 Sm | 9 (5-seed maj.) | 8/9 | **89%** | 0.020 | ✅ EB* leads (training-duration effect) |
| SmolLM3-3B | 9 | 3/9 | 33% | 0.910 | ‡ Censored: ck starts step40k, EB* already peaked |

> **Key finding:** Direction reversal confirmed at full 41-term N. At 160M behavioral leads EB* (scale threshold effect). At ≥1B EB* leads behavioral, p<0.01. CRFM (117M, 400k steps) shows 89% EB* lead — training duration not scale drives the transition. SmolLM3 C1-B censored by missing early checkpoints.

### ✅ Final Results: C4-B (within-term decoupling, 28 terms)

| Model | mean rho_early | mean rho_late | Strict decouple | Attenuation |
|---|---|---|---|---|
| Pythia-160M | +0.479 | +0.044 | 13/28 (46%) | 15/28 (54%) |
| Pythia-1B | +0.739 | **−0.054** | 15/28 (54%) | **20/28 (71%)** |
| Pythia-2.8B | +0.613 | +0.270 | 12/28 (43%) | 14/28 (50%) |
| OLMo-1B | +0.490 | −0.348 | 5/8 (62%) | 5/8 (62%) |
| CRFM GPT-2 Sm | +0.488 | +0.261 | 3/9 (33%) | 4/9 (44%) |
| SmolLM3-3B | +0.247 | **−0.189** | **6/9 (67%)** | 7/9 (78%) |

> **Interpretation:** 1B is the peak decoupling point (lowest rho_late, highest attenuation). 2.8B remains coupled at late checkpoints (rho_late=+0.270), which alongside C5 (where 2.8B has lowest RecDrop) suggests distributed representations at large scale. C5 canonical41 remains primary causal evidence.

### ✅ Final Results: C5 Canonical41

| Model | RecDrop | GenDrop | Specificity | Support |
|---|---|---|---|---|
| Pythia-160M | +0.112 | +0.068 | +0.091 | ⚠️ Weakly |
| Pythia-1B | **+0.151** | +0.055 | +0.084 | ⚠️ Weakly |
| Pythia-2.8B | +0.073 | +0.047 | +0.079 | ⚠️ Weakly |

> 1B has the largest RecDrop — binding heads are causally most necessary at intermediate scale. Specificity decreases with model size (binding heads become one of many routes at 2.8B). All models just below the 0.10 threshold; top heads all concentrated in early layers at 2.8B (L1 dominant).

### 🔲 GPU Run Queue (requires CUDA)

```bash
# Phase 1b: CRFM GPT-2 Small lifecycle (5 seeds × 8 ck × 9 terms)
python src/extract_binding_crfm.py --all   # all seeds × all checkpoints
python src/eval_behavior_crfm.py --all

# After CRFM extraction: re-run C1-B/C4-B to include CRFM
python src/analyze_c1b_within_term.py --model crfm
python src/analyze_c4b_decoupling.py --model crfm

# Phase 1c: SmolLM3-3B lifecycle (8 ck × 9 terms)
# Note: verify HuggingFaceTB/SmolLM3-3B-checkpoints stage1-step-{N} branches load correctly first
python src/extract_binding_smollm3.py --probe
python src/extract_binding_smollm3.py --all
python src/eval_behavior_smollm3.py --all

# Phase 4b/c/d: C5 for new models
# python src/run_c5_canonical.py --model crfm --seed all  (after writing CRFM C5 wrapper)
# python src/run_c5_canonical.py --model smollm3
# python src/run_c5_canonical.py --model qwen
```

### Execution Order

```
Phase 0a  ✅ Pythia tokenization audit (45 terms) — DONE
Phase 0a' ✅ CRFM/SmolLM3/Qwen tokenizer audit (45 terms) — DONE (all ✅)
Phase 0b  ✅ Canonical prompt register (41 terms) — DONE

Phase 1   ✅ C1-A/B data for Pythia + OLMo (41 terms) — DONE
Phase 1-W ✅ Wave-2 extraction (12 new terms, 3 Pythia × 8 ck) — DONE
Phase 1b  � CRFM GPT-2 Sm seed 1 — RUNNING (binding + behavioral)
Phase 1b  🔲 CRFM GPT-2 Sm seeds 2–5 — queued
Phase 1c  🔲 SmolLM3-3B (8 ck × 9 terms) — scripts ready, pending run

Phase 3   ✅ C4-A/B analysis on Pythia + OLMo — DONE
Phase 3'  🔲 C4-B CRFM + SmolLM3 — after Phase 1b/1c

Phase 2   🔲 C3 — CRFM, SmolLM3, Qwen2.5 — GPU required

Phase 4a  ✅ C5 canonical41 Pythia 3 models — DONE
Phase 4b  🔲 C5 CRFM GPT-2 Sm (5 seeds) — after Phase 1b
Phase 4c  🔲 C5 SmolLM3-3B — after Phase 1c
Phase 4d  🔲 C5 Qwen2.5-1.5B — standalone (no lifecycle needed)

Phase 5   🔲 Paper integration (results.md + appendix update with final numbers)
```

---

## Paper Narrative: The Three-Act Structure

This section documents the intended narrative arc in `paper/sections/results.md` and `methods.md` to justify the analytical evolution from 9 terms to 45 terms without making the paper feel ad hoc.

---

### Act I — Pilot Validation (3 → 9 terms, §3.2 Methods / §4.1–4.3 Results)

**What to write:**
> We begin with three canonical web accessibility terms (screen reader, skip link, alt text) chosen to span a range of token frequencies, semantic compositionality, and WCAG criticality. These establish feasibility: EB* is measurable, behavioral prompts produce reliable scores, and the lifecycle curve is visible at the term level. We then expand to 9 terms (Set B) to test whether the between-term Spearman structure holds across a diverse pilot set. The 9 terms were selected to maximise EB* variance across the set — spanning high-binding terms (screen reader, keyboard navigation) and lower-binding terms (landmark region) — which is a prerequisite of the between-term Spearman test. C1/C4 Spearman on these 9 terms yields ρ_early = +0.X (coupling) and ρ_late = −0.X (decoupling) across Pythia scales and OLMo.

**Why this framing works:** It makes the 9-term selection look deliberate (EB* variance maximisation), not arbitrary. The small N is a feature of the test design, not a limitation.

---

### Act II — Scaling Failure and Redesign (9 → 21 → 45 terms, §3.2 Methods / §4.4 Results)

**What to write:**
> To assess generalisability, we expanded to 21 tier-1/2/3 terms (§3.2). Contrary to expectations, the between-term Spearman collapsed to ρ ≈ 0.0–0.21 throughout training. Analysis revealed the cause: the 21 new terms cluster tightly in EB* (range 0.60–0.77, SD = 0.05), providing insufficient between-term variance for the cross-sectional test. This is not evidence against the lifecycle — it is evidence that the between-term Spearman is a *variance-sensitive* test whose power depends on term selection. The lifecycle may hold within each individual term but cannot be detected by a test that requires spread across terms.
>
> This motivated a redesign of the C1/C4 analysis from a between-term cross-sectional test to a within-term temporal precedence test (§3.X). The redesigned test asks, for each term independently: does EB* at checkpoint k predict behavioral score at k+1 better than the reverse? This test has no between-term variance requirement and scales to any number of terms.

**Why this framing works:** The failure on 21 terms becomes a *discovery* — it reveals the statistical dependency of the original test and motivates a more general methodology. Reviewers will see it as intellectual honesty and methodological rigour, not a failed replication.

---

### Act III — Generalisation (45 terms, §4.5 Results / §5 Discussion)

**What to write:**
> Applying the within-term temporal precedence test to all 45 terms across [N] models, we find EB* leads behavioral emergence in Y% of terms (binomial p = Z). This result is robust across model scales, architectures (GPT-NeoX, GPT-2, LLaMA-3, Qwen2), and training data compositions (Pile, C4, multilingual, 18T-token). The fraction of terms showing decoupling in C4-B increases with model scale — 160M: X%, 1B: Y%, 2.8B: Z% — confirming that the lifecycle is not a small-model artefact. C5 causal ablation on the same 41-term pool confirms that the heads identified by EB* are causally necessary at 160M and causally decoupled at 2.8B, directly bridging the correlational lifecycle evidence with a mechanistic causal claim.

**Why this framing works:** The three-act structure shows a principled research trajectory:
1. Pilot → small-scale validation of the mechanism
2. Expansion → discovery of the test's statistical limits, motivating redesign
3. Generalisation → redesigned test applied at scale, broader claims supported

This is the narrative of iterative scientific refinement — not a series of failed replications.

---

### Key Sentences to Place in `methods.md §3.2`

```
We report two variants of the C1/C4 lifecycle test:

- C1-A/C4-A (between-term Spearman): applied to the 9-term pilot set (T-V2),
  which was selected to span EB* profiles. Requires between-term EB* variance.

- C1-B/C4-B (within-term temporal precedence): applied to all 45 terms.
  For each term t, we test whether EB*(t, k) predicts Beh(t, k+1) better than
  the reverse using a 1-step forward lag correlation. The population-level
  claim (H1: lead fraction > 0.5) is tested with a binomial test across 45 terms.
  This test does not require between-term EB* variance and generalises to
  any term set.

The two variants are complementary: C1-A/C4-A provides a high-contrast
demonstration of the lifecycle pattern on a carefully selected pilot set;
C1-B/C4-B provides a term-agnostic validity check at scale.
```
