Attention-Head Binding as a Term-Conditioned Mechanistic Marker of Accessibility Concept Emergence in Language Models
Abstract: Assessing when language models develop specific capabilities remains challenging, as behavioral evaluations are expensive and internal representations are opaque. We introduce attention-head binding ($EB^*$), a lightweight mechanistic metric that tracks how attention heads bind multi-token technical terms, such as accessibility concepts (“screen reader,” “alt text”), into coherent units during training. Using Pythia models (160M, 1B, 2.8B) across eight checkpoints, we report four empirical findings (C1, C3, C4, C5). At 160M and 2.8B, binding precedes behavioral competence (Spearman $r = 0.33$–$0.34$, $p < 0.001$), serving as an early warning signal (C1). At 1B, we observe a decoupling effect: binding saturates early, while behavior continues to improve, revealing divergent developmental trajectories (C4). High-binding/mid-accuracy checkpoints contain unlockable latent knowledge: few-shot prompting yields up to $+61$ percentage points improvement ($183%$ relative gain) and near-ceiling generation scores ($94.4%$) from low zero-shot baselines (C3). Causal ablation reveals opposite mechanistic regimes across scales: high-binding heads are necessary at 160M (ablation impairs recognition by $-16.7$ percentage points [pp]) but functionally superseded at 2.8B (ablation improves recognition by $+33.3$ pp). This provides direct evidence for the decoupling phenomenon (C5). These findings not only establish attention binding as a diagnostic for concept emergence but also demonstrate that mechanistic structure and behavioral competence undergo qualitative transformation across model scales — a phenomenon we term the binding–behavior decoupling effect. Code: available in the supplementary material.
Submission Type: Long submission (more than 12 pages of main content)
Assigned Action Editor: ~Xingchen_Wan1
Submission Number: 7505
Loading