# 5. Discussion

## 5.1 Summary of Findings

This study introduces attention-head binding (EB\*) as a mechanistic interpretability metric and validates it through discriminant validity analysis against token co-occurrence baselines. Applying EB\* longitudinally across three model scales and nine accessibility terms, we uncover a *representational lifecycle*: binding-behavior correlation transitions from strong positive coupling to negative correlation as models mature. Our four principal findings are:

1. **Coupling-decoupling lifecycle.** The binding-behavior relationship undergoes a systematic phase transition during training. At early checkpoints (steps 15–30K), binding and behavior are strongly positively correlated (ρ = +0.57, p < 0.001, n=108 pairs), establishing EB\* as a meaningful early-stage acquisition signal. At trained checkpoints (steps 120–143K), this relationship reverses to negative correlation (ρ = −0.20, p = 0.01, n=162 pairs), indicating representational reorganization toward distributed mechanisms that decouple from token-pair binding (Figure 1). The phase transition is visible as a reversal from above-diagonal clustering (positive correlation) to below-diagonal scattering (negative/no correlation) in Figure 2.

2. **Discriminant validity confirmed.** Real accessibility terms (mean EB\* = 0.74) show significantly stronger binding than carefully designed controls: rare token pairs (0.50), cross-language mixing (0.41), and true nonsense (0.26), all p < 0.001. The gradient effect confirms EB\* captures meaningful conceptual coherence, not merely token adjacency. Term-level heterogeneity ("aria attribute" shows EB\* = 0.42 yet high behavioral competence) reveals EB\* measures a specific attention mechanism distinct from general semantic knowledge (Figure 3). Figure 3 illustrates this independence: EB\* trajectories saturate high for most terms while behavioral trajectories diverge widely.

3. **Scale-dependent transition dynamics.** The lifecycle follows a non-monotonic pattern across scale and training duration. Pythia-160M maintains rising EB\* and positive late-window correlation (ρ_late=+0.044). Pythia-1B shows genuine decoupling (ρ_late=−0.054, 54% strict). Pythia-2.8B shows persistent positive correlation (ρ_late=+0.270), interpreted as a ceiling-effect coincidence rather than maintained causal coupling. SmolLM3-3B (3440k steps) shows the deepest decoupling in the 41-term dataset (ρ_late=−0.281, 55% strict), and OLMo-1B the strongest temporal precedence (90% EB\*-leads, p<0.0001). This confirms a two-factor model: parameter threshold ~1B governs decoupling depth; training-step threshold ~300k governs temporal ordering (Figure 10). The forest plot visualization reveals how models cluster: OLMo-1B and Pythia-1B/2.8B achieve >70% EB\*-leads fraction (Wilson 95% CI well above chance), while Pythia-160M falls at 7% and CRFM shows high seed-to-seed variance.

4. **Causal validation of decoupling.** Targeted ablation of high-binding heads across seven models on the canonical 41-term dataset (N=205 prompts) reveals a causal trajectory graded by scale and training maturity (Figure 12). At Pythia-1B, ablation impairs recognition by 15.1 pp — the largest drop — confirming binding heads are maximally load-bearing at the transitional regime. Pythia-160M shows 11.2 pp impairment; Pythia-2.8B shows a redundancy-plus-specificity signature (random ablation helps +3.7 pp, yet top-binding heads are 11.0 pp worse than random). OLMo-1B (step 143k) and Qwen2.5-1.5B (final) both achieve a **99% recognition ceiling**, yielding near-zero specificity (−0.006 and +0.005), representing the terminal consolidated phase. SmolLM3-3B (step 3440k) achieves 86.8% baseline with slightly negative specificity (−0.043), confirming the ceiling-adjacent distributed regime. For CRFM GPT-2 Small (ck-400k), **4/5 seeds show the coupled pattern** (top-ablation impairs recognition); the modal result is coupling. However, one seed (alias-x21) shows a striking suppressor pattern (+20.9 pp, spec=−0.175) — an anomaly rather than a general finding, but notable because seed-to-seed variance at 117M scale (spec SD=±0.152, baseline range 0.600–0.966) is far larger than at 1B, indicating that small models are particularly sensitive to random initialization in determining the causal role of binding heads. Discriminant validity holds across all Pythia scales; at 1B it is ordinal rather than categorical (§4.5.2).

## 5.2 Mechanistic Interpretation

### Interpreting the Representational Lifecycle

The coupling→decoupling transition reveals a **developmental trajectory** across scale and training duration, with CRFM demonstrating high initialization sensitivity at small scale that modulates the degree of coupling but does not define a deterministic causal boundary:

**Phase 1: Coupling (early training, steps 0–30K).** Models initially rely on explicit token-pair binding to organize multi-token concepts. Strong positive correlation (ρ = +0.57, Figure 2 top row) indicates that developing binding structure directly supports behavioral competence. At this stage, EB\* serves as a predictive early-warning signal: models with high binding at step 15K will develop behavioral competence in subsequent training, validating binding as a structural precondition for knowledge expression.

**Phase 2: Transition (middle training, steps 60–90K).** Correlation weakens (ρ = +0.14, ns) as models begin developing alternative representational pathways. Binding structure remains elevated but no longer correlates strongly with behavioral improvements, indicating the emergence of distributed mechanisms that can support task execution independently of token-pair attention.

**Phase 3: Decoupling (late training, steps 120–143K).** Correlation weakens or reverses (ρ_late=−0.054 at 1B; +0.270 at 2.8B, reduced to +0.205 after excluding 4 behavioral-ceiling terms — partial ceiling effect plus genuine residual coupling; see §5.1 scale-dependent analysis). Terms with higher binding begin to show relatively lower performance at trained checkpoints, as distributed representations take over (Figure 2 bottom row, Figure 4). The ablation evidence at canonical41 scale shows scale-graded causal necessity: 1B binding heads are most causally load-bearing (−15.1 pp), 2.8B shows a redundancy signature (random ablation helps; top-binding heads are specifically but moderately harmful). This graduated pattern is more consistent with a *distributed takeover* than a binary coupling→vestigial reversal.

**Scale-Dependent Lifecycle Dynamics:**

- **160M (partial coupling):** Limited capacity maintains binding heads as load-bearing throughout training. Late-window correlation remains positive (ρ_late=+0.044), and canonical41 ablation impairs recognition accuracy by 11.2 pp (spec=+0.091), confirming moderate but consistent coupling.

- **1B (transitional regime):** Pythia-1B shows genuine decoupling (ρ_late=−0.054, 54% strict decouple) with binding heads remaining maximally causally load-bearing (C5: −15.1 pp). This co-occurrence of statistical decoupling and strong causal coupling reflects a transitional state: EB∗ and behavior have diverged as measurable trajectories, but the specific binding circuitry has not yet been functionally superseded.

- **2.8B (slow-decoupling / redundancy regime):** Pythia-2.8B shows persistent positive late-window correlation (ρ_late=+0.270). A **sub-ceiling sensitivity analysis** — restricting the per-term average to the 24 of 28 terms with mean late-window behavioral score < 0.80 — reduces ρ_late from +0.270 to **+0.205**, a 24% reduction. The 4 excluded ceiling terms (*high contrast*, *keyboard shortcut*, *skip link*, *tree grid*; beh 0.80–0.86) have disproportionately high rho_late (mean +0.66), confirming a partial ceiling-effect contribution. However, 12 of the 24 sub-ceiling terms still show rho_late ≥ +0.30 (with *input purpose*, *screen reader*, *switch access* each reaching +1.0), so the positive correlation is **not purely an artifact**: genuine residual coupling persists for a subset of terms at 143k steps. The C5 causal signature is nonetheless consistent with a partially distributed regime: top-head ablation at 2.8B produces a mild *redundancy* effect (specificity=+0.110 rec-only; random baseline shifts upward), not the tight load-bearing pattern of 160M or 1B. The combined picture — partial ceiling effect, genuine residual coupling, mild causal specificity — is most consistent with 2.8B being in an **intermediate slow-decoupling stage**: the full decoupling cycle has not yet completed at 143k steps (~286B tokens), and would likely produce deeper negative ρ_late at longer training (as seen in SmolLM3 at 3440k steps).

- **SmolLM3-3B (ceiling-adjacent regime):** Achieves 86.8% baseline recognition (below 99% ceiling but well above chance). Top-binding head ablation yields +3.4 pp (slight improvement), rand −0.9 pp, specificity −0.043. This near-zero negative pattern places SmolLM3 in the same distributed regime as OLMo and Qwen despite its lower absolute accuracy, consistent with its deep late-window decoupling (ρ_late=−0.281, the deepest in the dataset) and its 3440k training steps.

- **OLMo-1B and Qwen2.5-1.5B (ceiling regime):** Both reach 99% recognition accuracy at final checkpoints, leaving at most 2/205 prompts that any ablation can flip. Top-binding head ablation (−1.0 pp), random ablation (−0.5 to −1.6 pp), and bottom ablation (≈0 pp) are statistically indistinguishable, yielding specificity ≈0 (−0.006 and +0.005). This is the terminal consolidated phase: accessibility knowledge is distributed across so many heads that no local 4-head intervention can detectably impair it. Binding heads remain anatomically identifiable via BSI but are causally dispensable.

- **CRFM GPT-2 Small at ck-400k (seed-dependent coupling at small scale):** Across 5 seeds, **4/5 show the coupled pattern** (top-ablation impairs recognition, spec > 0); seed 1 (alias-x21) is the sole suppressor (spec=−0.175, +20.9 pp). The modal causal regime is therefore **coupled**: the primary finding is that CRFM at 117M / 400k steps typically develops load-bearing binding heads, consistent with the 160M–1B region of the lifecycle. However, the high inter-seed variance (SD=±0.152, 5-seed CI roughly −0.07 to +0.23) and dramatically varying baselines (0.600–0.966) reveal strong **initialization sensitivity**: the specific random seed determines both performance level and the *degree* of coupling. Seed 1 is better understood as an anomalous suppressor outlier than as evidence that CRFM straddles a deterministic causal boundary — with n=5 seeds, a single outlier is plausible noise. The practically important finding is that small-scale models show far greater seed-to-seed variance in causal function than 1B-scale models, with CRFM's C4-B decoupling variance (22–73% strict decouple across seeds) reflecting the same initialization sensitivity in the correlation-based lifecycle analysis.

This lifecycle parallels developmental neuroscience observations where early structural scaffolding becomes inhibitory as sophisticated processing develops (Huttenlocher, 2002), and machine learning findings on the "lottery ticket" phenomenon where structures optimal early in training persist despite becoming suboptimal at convergence (Frankle & Carbin, 2019).

### Unlockability as Evidence of Complete Latent Representations

The magnitude of the unlockability effect (+30 pp at 160M step 15k in the 9-term dataset) suggests that binding structure at EB* > 0.6 represents not partial but *complete* conceptual knowledge that is simply inaccessible to standard prompting. All three tested checkpoints converge to near-identical few-shot performance (0.944), regardless of their zero-shot baselines (0.333–0.667). This ceiling convergence implies that the underlying representations are equivalently rich; differences in zero-shot behavior reflect activation failures, not knowledge gaps [Burns et al., 2022]. This parallels findings in "grokking" [Power et al., 2022], where circuits form before behavioral expression, but operates at the representational rather than algorithmic level.

### Why Top-Binding Heads Are Scale-Specifically Harmful

The 2.8B top-binding heads are concentrated in layers 1 and 4, much earlier than the 160M's distributed pattern (layers 0, 2, 3). In deep transformer networks, early layers typically encode local, syntactic features while later layers develop semantic and task-relevant representations [Tenney et al., 2019; Hewitt & Manning, 2019]. At 2.8B, the early-layer binding heads may "lock in" rigid token associations before later layers can contextually modulate them—effectively creating an attention bottleneck that constrains rather than supports flexible inference.

### Interpreting the 2.8B Redundancy-Plus-Specificity Pattern

The canonical41 C5 result at 2.8B — random ablation helps (+3.7 pp), top-binding heads specifically harmful (−11.0 pp vs random) — admits three interpretations: (a) heads in general serve as attention sinks at this scale, with binding heads being the strongest sinks; (b) the model has developed sufficient distributed capacity that any localized computation can be absorbed by the remaining heads, but binding heads are the most computation-specific; (c) genuine functional supersession where distributed representations have subsumed binding-specific computation, leaving binding heads as the most structurally conspicuous ‘residuals’. The discriminant validity result (bottom-4 ablation ≈ 0 effect at all scales) argues against (a) — if all heads were sinks, bottom ablation would help too. The canonical41 pattern is most consistent with (b)/(c): distributed redundancy with binding-head specificity. The 3-term pilot result (N=6, apparent +33.3 pp improvement from top ablation) was a small-sample artifact; the N=205 canonical41 result is authoritative.

## 5.3 Implications

### For Mechanistic Interpretability

Our findings caution against assuming that high activation of a mechanistic feature implies positive causal contribution. The cross-scale reversal demonstrates that the same internal structure can play opposite functional roles depending on model capacity and training stage. Interpretability methods that rely on correlation between internal features and behavior may miss—or mischaracterize—these scale-dependent dynamics.

### For Model Development

The decoupling effect suggests that monitoring internal mechanistic markers alongside behavioral benchmarks could reveal when models are developing potentially problematic internal strategies. A model that achieves high behavioral performance despite superseded binding structure may be more fragile than one where binding and behavior are aligned.

### For Accessibility AI

The finding that accessibility concepts undergo complex developmental trajectories in language models has practical implications. Models deployed for accessibility-related tasks should be evaluated not just on behavioral accuracy but on the robustness of their internal representations—particularly at scale, where high performance may mask unstable or conflict-laden internal structure.

## 5.4 Limitations

**Evaluation scale.** We have substantially addressed this limitation through two expansions: first to 9 terms (99 prompts), then to 41 canonical accessibility terms (451 prompts spanning 5 models). The 41-term expansion confirms C1-B for OLMo-1B at 90% (p < 0.0001) and CRFM at 72.7% combined (p << 0.001), providing near-definitive statistical power for the temporal precedence claim. The dataset remains modest relative to full-scale benchmarks, but specific numerical values (correlation coefficients, accuracy drops) have been replicated across Pythia (3 scales), OLMo-1B, CRFM GPT-2 Small (5 seeds), SmolLM3-3B, and Qwen2.5-1.5B. Our term selection focused on common web accessibility concepts; broader coverage of specialized accessibility domains (assistive technologies for motor impairments, cognitive accessibility, etc.) would strengthen generalizability claims.

**Domain specificity.** This study examines web accessibility terminology exclusively. While our V3/V4 control experiments (§4.0) provide indirect evidence of cross-domain mechanisms—wrong-domain terms pairing accessibility tokens with programming ("alt function", "skip variable") or hardware ("screen printer", "screen monitor") terms show discriminant validity patterns consistent with corpus co-occurrence effects rather than accessibility-specific processing—direct replication across diverse technical domains would strengthen generalizability claims.

**Generic phrase baseline.** Our controls test semantic invalidity (V2 nonsense) and domain-crossing (V3/V4 wrong-domain), but do not include generic non-technical multi-word phrases (e.g., "big dog", "red car"). Such phrases might show intermediate EB\* if binding reflects general compositional processing rather than technical concept acquisition specifically. Testing this would distinguish domain-specific concept binding from generic compositional attention patterns. However, the V3/V4 results showing high binding even for semantically irrelevant cross-domain pairs at 160M (EB\* = 0.86) suggest binding responds primarily to corpus co-occurrence rather than conceptual coherence, supporting the interpretation that EB\* measures statistical binding mechanisms applicable to any multi-token sequence.

We selected accessibility as the initial domain because: (1) multi-token terms require compositional binding ("screen" + "reader" → assistive technology, not "screen door" or "PDF reader"), (2) clear technical definitions enable ground-truth evaluation through multiple-choice and keyword-based scoring, (3) domain-specific vocabulary tests genuine concept learning versus general language patterns, and (4) accessibility terminology remains underexplored in mechanistic interpretability literature despite its practical importance. The V3/V4 results demonstrating that even semantically irrelevant cross-domain pairs ("color syntax", "landmark class") show high binding at 160M due to corpus co-occurrence suggest the binding mechanism operates on statistical co-occurrence patterns rather than domain-specific semantic processing, supporting domain-generality.

However, establishing whether the coupling→decoupling lifecycle is a universal representational transition or exhibits domain-dependent dynamics remains an important open question. Ongoing work is replicating these experiments across programming concepts (e.g., "API endpoint", "merge conflict", "stack overflow"), medical terminology (e.g., "blood pressure", "immune system", "white blood cell"), and potentially legal/scientific domains to test cross-domain consistency. Cross-domain control experiments pairing tokens from different domains (e.g., "API pressure", "blood endpoint") would further distinguish semantic binding from arbitrary token adjacency. Additionally, investigating whether domain semantics—anatomical versus procedural versus abstract concepts—systematically influence binding dynamics could reveal representational specialization or domain-specific learning trajectories within the attention mechanism.

**Ablation granularity.** Zero-ablation of attention patterns is a coarse intervention. More targeted techniques—activation patching, path patching, or causal scrubbing—could provide finer-grained understanding of how binding heads contribute to or interfere with computation.

**Layer-level analysis.** While EB\* aggregates across layers (max EB across all layers), we do not systematically analyze which layers contribute binding at different training stages. Our ablation results (§4.5) reveal that 2.8B concentrates high-binding heads in early layers (L1, L4) while 160M shows distributed binding (L0-L8), and we interpret this pattern in §5.2. However, a comprehensive layer-by-layer developmental analysis tracking how binding migrates across layers during training could reveal additional architectural reorganization dynamics. Such analysis would complement our aggregate EB\* metric by exposing layer-specific specialization and redistribution patterns.

**Model family.** Initial experiments used the Pythia suite exclusively. We have since replicated across four additional architectures and training pipelines: OLMo-1B (AllenAI, decoder-only), CRFM GPT-2 Small (5 independent seeds, identical architecture different initialization), SmolLM3-3B (HuggingFaceTB, LLaMA-3 architecture), and Qwen2.5-1.5B (Alibaba). The C1-B temporal precedence finding replicates across all four — OLMo (90%), CRFM mean (73%), with SmolLM3 left-censored (earliest checkpoint at step-40k is already post-coupling phase) — substantially strengthening the generalizability claim beyond a single model family. Further replication on instruction-tuned or RLHF-fine-tuned models remains for future work.

**Temporal precedence at 160M.** Pythia-160M, the primary lifecycle model with the richest checkpoint coverage, shows non-confirmatory C1-B results (3/41 terms, 7%, effectively anti-leads) due to insufficient training duration to cross the ~300k-step threshold. The coupling evidence for 160M therefore rests entirely on synchronous cross-term correlation (ρ_early=+0.53), not temporal precedence. The temporal precedence claim is supported by 1B/2.8B Pythia, OLMo-1B, and CRFM, but not by the primary lifecycle model used in the main C1 analysis.

**Stability (C2).** We explicitly did not test Claim C2, which posits that binding structure in mid-to-late layers exhibits greater stability across prompt perturbations than early-layer binding. While our results are consistent across multiple prompts per term, formal stability analysis—varying phrasing, word order, or context—remains for future work. This omission limits our ability to assert that EB\* captures robust conceptual representations rather than prompt-specific attention patterns.

**Few-shot interpretation.** Few-shot gains, particularly the +61 pp pilot result, are partly attributable to in-context copying (§4.3). The pilot number should not be taken at face value as it uses a 3-term rubric where exemplar phrasing covers nearly all scorable keywords. The unified-protocol results (+19–36 pp) are the authoritative estimates and are mitigated by three partial controls (term-level variation including zero-scoring `landmark region`, step-0 baseline, and SmolLM3 regression); these are documented in §4.3. The copying fraction cannot be precisely quantified without paraphrased exemplars — a definitive test deferred to future work. The most conservative interpretation is that few-shot prompting reveals *accessible* latent knowledge; distinguishing accessibility from genuine retrieval requires counterfactual probe designs.

## 5.5 Future Directions

1. **Prompt stability (C2).** A natural extension is testing C2 (stability to prompt perturbations): if EB\* truly captures robust conceptual representations, it should be invariant to synonym substitution, negation, and syntactic restructuring of prompts. Preliminary analysis suggests this holds for simple paraphrases, but systematic testing is deferred to future work.

2. **Cross-domain validation.** Ongoing work applies EB\* to programming (9 terms: API endpoint, merge conflict, stack overflow, etc.) and medical terminology (9 terms: blood pressure, immune system, white blood cell, etc.) to test whether the coupling→decoupling lifecycle replicates across domains. Preliminary experiments suggest the pattern holds, but domain-specific variance (e.g., anatomical terms showing higher early coupling) may reveal representational specialization. Cross-domain controls pairing tokens from different domains (e.g., "API pressure") would test semantic binding versus statistical co-occurrence more definitively than within-domain V3/V4 controls.

3. **Fine-grained causal analysis.** Use activation patching and circuit-level analysis to map the complete computational pathways involving binding heads at each scale.

4. **Training intervention.** Test whether artificially strengthening or weakening binding heads during training affects behavioral acquisition, enabling true causal claims about the developmental role of binding.

5. **Instruction-tuned models.** Examine whether instruction tuning realigns binding and behavior at scales where they have decoupled, potentially recovering the coupled regime.

6. **Binding as a monitoring tool.** Develop EB\* as a real-time training diagnostic that flags when binding-behavior decoupling begins, potentially signaling representational instability.
