Attention-Head Binding as a Term-Conditioned Mechanistic Marker of Accessibility Concept Emergence in Language Models

Attention-Head Binding as a Term-Conditioned Mechanistic Marker of Accessibility Concept Emergence in Language Models

13 Feb 2026 (modified: 25 Apr 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Assessing when language models develop specific capabilities remains challenging, as behavioral evaluations are expensive and internal representations are opaque. We introduce attention-head binding ($EB^*$), a lightweight mechanistic metric that tracks how attention heads bind multi-token technical terms, such as accessibility concepts ("screen reader," "alt text"), into coherent units during training. Using seven models across five architectures, including Pythia (160M, 1B, 2.8B), OLMo-1B, CRFM GPT-2 Small (5 seeds), SmolLM3-3B, and Qwen2.5-1.5B, we evaluate on 41 canonical accessibility terms ($N=205$ prompts) and the 9-term pilot set, reporting five empirical findings. Discriminant validity validates $EB^*$ against token co-occurrence baselines (nonsense $0.26 \to$ real terms $0.74$, all $p<0.001$, $d=1.2$--$2.9$). The relationship between binding and behavior shifts markedly over the course of training. Early in training, the two are tightly coupled ($\rho=+0.57$, $p<0.001$). Later, this pattern reverses into a decoupled regime ($\rho=-0.20$, $p=0.01$). Cross-architecture replication confirms C1-B: OLMo-1B achieves 90% $EB^*$-leads ($p<0.0001$), CRFM 72.7% ($p<<0.001$). This gives rise to a two-factor model. The first factor is a parameter threshold around 1B parameters that controls how deeply decoupling occurs. The second is a training-step threshold near 300K steps determining when the temporal ordering between binding and behavior emerges (C1/C4). High-binding/mid-accuracy checkpoints contain unlockable latent knowledge, yielding few-shot gains up to 61 percentage points (a 183% relative improvement), replicated at 18--37 points across six of seven models (CRFM shows weak unlockability at +7.6 pp due to undertraining). Modern models such as SmolLM3 and Qwen show headroom compression where they reach the same absolute ceiling near 0.72, but display smaller nominal gains because their zero-shot baselines are already high (C3). Causal ablation reveals opposite regimes across scales. At 160M, binding heads remain necessary for performance. Removing them impairs accuracy by 16.7 percentage points. At 2.8B, these same heads have become functionally superseded; ablating them improves performance by 33.3 points. Cross-architecture C5 reveals three distinct patterns. First, OLMo and Qwen achieve near-perfect recognition ceiling with negligible ablation effects. Second, SmolLM3 operates in a distributed regime with negative specificity ($-0.043$). Third, CRFM displays striking initialization sensitivity, with four of five random seeds showing coupled behavior and one seed exhibiting suppressor dynamics (C5). These findings not only establish attention binding as a diagnostic for concept emergence but also demonstrate that mechanistic structure and behavioral competence undergo qualitative transformation across model scales, a phenomenon we term the "binding-behavior decoupling effect". Code: available in the supplementary material.

Submission Type: Long submission (more than 12 pages of main content)

Changes Since Last Submission: This revision expands from 9 to **41 canonical accessibility terms** (205 recognition prompts) and replicates all five empirical claims across **7 models / 5 architectures**. --- ### **Addressing R1 (scale skepticism) & R3 (cross-model replication)** 41-term register (WCAG 2.1 categories: Perceivable 12, Operable 14, Understandable 7, Robust 4, Cross-cutting 4). Draws from WCAG 2.1 (61 criteria), WAI-ARIA 1.2 (~200 terms), and AccessEval benchmark (Panda et al., 2025). Prior mechanistic interpretability work on multi‑token concepts has typically analyzed only a small number of such terms (e.g., Nanda et al., 2023; Lieberum et al., 2023), often on the order of roughly 10 or fewer explicitly studied cases; we exceed this 2–4×. Computational requirements (log-derived): ~2–5 GPU-min/term/checkpoint at 1B scale. Full reproducible pipeline: ~20–25 GPU-hours (wall-clock 3–7 days). Total project R&D (Feb pilot, Apr expansion + cross-architecture): ~40–60 GPU-hours. 7 models / 5 architectures: Pythia 160M/1B/2.8B (GPT-NeoX), OLMo-1B (Dolma), CRFM GPT-2 Small (5 seeds), SmolLM3-3B, Qwen2.5-1.5B. Qwen excluded from lifecycle analyses due to the lack of intermediate checkpoints; CRFM provides only 2 checkpoints. Lifecycle claims (C1/C4) cover 6 models; single-checkpoint claims (C3/C5) cover all 7. **Key results (41-term)**: - **C1-B**: OLMo-1B 90% EB$^*$-leads ($p<0.0001$), CRFM 72.7%, Pythia-1B 73.2%, Pythia-2.8B 79.4%. Pythia-160M 7% = maintained coupling below 1B threshold, not a failure. - **C4-B**: Strict decoupling deepest at SmolLM3-3B (55%, $\rho_\text{late}=-0.281$), Pythia-1B 54%, Pythia-160M 46%. - **C3**: Pythia-1B strongest (+37.0 pp). Modern models (SmolLM3 +18.0, Qwen +18.2) show **headroom compression** (ceiling ~0.72, high ZS baselines). CRFM outlier (+7.6 pp, undertrained). - **C5**: OLMo/Qwen near-perfect ceiling (negligible ablation). SmolLM3 distributed (spec $-0.043$). CRFM initialization-sensitive (4/5 coupled, 1/5 suppressor). **Discriminant validity** (V2–V4): robust at 2.8B ($d=1.2$–$2.9$), partial at 1B, **fails at 160M** = scale-dependent precision limit, not flaw. --- ### **Transparency corrections** - Lifecycle claims: corrected from "7 models" to "6 with lifecycle data" (Qwen excluded). - Pythia-160M 7%: framed as scale-dependent boundary condition, not replication failure. - C3: corrected from "all seven" to "six of seven" (CRFM weak at +7.6 pp). - Figure captions: trajectory figures (Fig 1, 2) disclosed as 3-term pilot data with references to 41-term tables. - Scoring: manual rubric validation noted as pilot-era; 41-term primarily uses recognition accuracy. --- ### **New artifacts** - Tables: C1-B temporal (6 models), C4-B decoupling (6), C3 unlockability (7), C5 causal (7) - Figure: C1-B forest plot (Wilson 95% CIs) - Appendix A.1j: 41-term × 6 model C4-B per-term breakdowns - Appendix: complete prompt inventory (12 JSONL files)

Assigned Action Editor: ~Xingchen_Wan1

Submission Number: 7505

Loading