# 1. Introduction

Understanding how language models acquire and represent domain-specific knowledge is a central challenge in mechanistic interpretability. While behavioral evaluations reveal *what* a model knows, they provide limited insight into *how* and *when* internal representations form during training. This gap is particularly consequential for safety-critical domains such as web accessibility, where models are increasingly deployed to generate code, content, and recommendations that affect users with disabilities.

We address this gap by introducing *attention-head binding* (EB\*), a mechanistic metric that quantifies how strongly individual attention heads bind the constituent tokens of multi-token technical terms—such as "screen reader," "skip link," and "alt text"—into coherent conceptual units. Our central hypothesis is that this binding signal serves as an early, internal marker of concept acquisition that precedes externally observable behavioral competence.

We study the Pythia model suite (EleutherAI; 160M, 1B, and 2.8B parameters) across eight training checkpoints spanning the full training trajectory (step 0 through step 143,000). To test whether the binding-behavior lifecycle generalizes beyond a single model family, we replicate across four additional architectures: OLMo-1B (AllenAI, Dolma-trained), Stanford CRFM GPT-2 Small (117M, 5 random seeds), SmolLM3-3B (HuggingFaceTB, LLaMA-3, multilingual), and Qwen2.5-1.5B (Alibaba, 18T tokens). For each checkpoint, we measure both attention binding strength and behavioral performance on accessibility knowledge tasks (multiple-choice recognition and open-ended generation). This longitudinal design enables us to track the co-evolution of mechanistic structure and behavioral capability.

Our contributions are organized around four empirical claims (C1, C3–C5); a fifth claim concerning representational stability to prompt perturbations (C2) remains for future work (see §5.4). We validate these claims on two complementary datasets: (1) an expanded **9-term pilot set** for high-variance demonstration, and (2) a **41-term canonical register** for term-agnostic, cross-architecture validation (N=205 recognition prompts spanning seven model-checkpoint pairs, five architectures, and up to 5 random seeds for CRFM):

1. **Coupling-decoupling lifecycle (C1).** Binding-behavior correlation undergoes a phase transition during training. In the 9-term pilot, strong positive coupling at early checkpoints (ρ = +0.57, p < 0.001, steps 15–30K) transitions to negative correlation at late checkpoints (ρ = −0.20, p < 0.01, steps 120–143K). The **41-term canonical analysis** (within-term temporal precedence, C1-B) confirms EB\*-leads-behavior in 73–90% of terms across Pythia-1B/2.8B, OLMo-1B (90%, p<0.0001), and CRFM (73% mean across 5 seeds, p<<0.001)—establishing temporal precedence as a general property at ≥1B scale, while smaller models (160M) show anti-precedence due to insufficient training duration (§4.2).

2. **Unlockable latent knowledge (C3).** Models with high binding but low baseline performance contain latent knowledge that few-shot prompting can unlock. A pilot result shows +61 pp improvement (3-term, narrow rubric); the **41-term canonical protocol** confirms +18–37 pp across all seven models, with Pythia-1B late checkpoint showing the largest gain (+37.0 pp). Modern models (SmolLM3, Qwen) achieve the same absolute few-shot ceiling (~0.72) but lower nominal Δ due to higher zero-shot baselines—indicating headroom compression rather than weak coupling (§4.3).

3. **Scale-dependent decoupling (C4).** The coupling-decoupling transition is governed by a two-factor model: (1) a **parameter threshold** (~1B) governs decoupling depth, and (2) a **training-step threshold** (~300k steps) governs temporal ordering. In the 9-term pilot, 160M maintains positive correlation (checkpoint-level ρ = +0.93), while 1B/2.8B show systematic decoupling (ρ = −0.31, −0.28). The **41-term canonical analysis** reveals SmolLM3-3B (3.44M steps) achieves the deepest decoupling in the dataset (ρ\_late=−0.281, 55% strict), while OLMo-1B (ρ\_late=−0.181, 44%) and Pythia-1B (ρ\_late=−0.054, 54%) show graded transitions—confirming that longer training and larger scale accelerate decoupling (§4.4).

4. **Cross-scale and cross-architecture causal regimes (C5).** Targeted ablation of high-binding heads on the **41-term canonical dataset (N=205)** reveals a scale-graded trajectory across seven models: (1) *coupled* — binding heads necessary (Pythia-160M: −11.2 pp, spec=+0.137); (2) *load-bearing* — peak causal necessity at transitional scale (Pythia-1B: −15.1 pp, spec=+0.117); (3) *redundant/ceiling* — distributed representations with near-zero effect (OLMo, Qwen: ≈−1 pp, spec≈0); (4) *initialization-sensitive boundary* — CRFM GPT-2 at 117M shows seed-dependent outcomes (4/5 seeds coupled, 1/5 suppressor), demonstrating that small-model causal head function is not deterministically fixed by scale alone (§4.5.5).

These findings establish attention binding as a diagnostic tool for early-stage concept emergence and reveal a *representational lifecycle*: models initially rely on explicit token-pair binding (coupling phase), then reorganize toward distributed mechanisms that decouple binding from behavior (decoupling phase). This lifecycle generalizes across five architectures (GPT-2, GPT-NeoX, Dolma, LLaMA-3 GQA, Qwen2 GQA) and is accelerated at larger scales and longer training, suggesting that binding heads become causally vestigial as models mature—a hypothesis confirmed by our seven-model ablation experiments (C5).

We further validate the EB\* metric through discriminant validity analysis: real accessibility terms (mean EB\* = 0.74) show significantly stronger binding than carefully designed controls including rare token pairs (0.50), cross-language mixing (0.41), and true nonsense (0.26), all p < 0.001. Term-level heterogeneity reveals that EB\* captures a specific attention mechanism distinct from general semantic knowledge—some terms maintain strong coupling (e.g., "color contrast", ρ = +0.68), while others use distributed representations (e.g., "aria attribute", ρ = +0.07 despite high behavioral competence).

The remainder of this paper is structured as follows. §2 reviews related work in mechanistic interpretability, concept emergence, and accessibility in NLP. §3 describes our metrics, models, and experimental design. §4 presents results for each claim. §5 discusses implications, limitations, and future directions. §6 concludes.
