# 2. Related Work

## 2.0 Positioning: Why Mechanistic Analysis of Multi-Token Concepts?

**Existing approaches and their limitations.** Three primary approaches exist for studying concept knowledge in language models, each with distinct limitations for multi-token technical terms:

1. **Behavioral probing** (Petroni et al., 2019; Meng et al., 2022) measures what models know through task performance (e.g., question answering, classification). While effective for detecting knowledge presence, behavioral probes provide no insight into *when* knowledge forms during training (Olsson et al., 2022), *how* it is mechanistically represented, or *why* models with similar behavioral scores may differ in robustness. Recent work on continual pretraining (Xu et al., 2025) shows that behavioral performance can mask complex internal reorganization during training. Our discriminant validity experiments (§4.0) show that behavioral competence can exist with low binding (e.g., "aria attribute": 76% behavioral accuracy, 42% EB\*), indicating behavioral probes conflate multiple representational strategies.

2. **Token co-occurrence metrics** (e.g., pointwise mutual information, n-gram frequency) measure statistical association in training data. Our control experiments demonstrate these metrics fail to distinguish meaningful conceptual binding from arbitrary token adjacency: initial controls using plausible bigrams like "keyboard mouse" and "open source" showed EB\* = 0.72–0.82, statistically indistinguishable from real accessibility terms (p > 0.05). Only genuinely nonsensical controls ("zqx plarf", "écran reader") established discriminant validity (§4.0), revealing that EB\* captures representational structure beyond corpus statistics.

3. **Single-token concept analysis** (Burns et al., 2022; Cunningham et al., 2023) examines how models represent individual concepts through probing and sparse autoencoders. Multi-token technical terms present a fundamentally different challenge: the model must learn to bind constituents ("screen" + "reader") into a coherent unit distinct from other valid compositions ("screen door", "PDF reader"). Recent work shows that LLMs perform detokenization to reconstruct multi-token words (Kaplan et al., 2025), but this process occurs at early layers and does not explain how conceptual meaning emerges from compositional binding. Attention binding provides a mechanistic signature of this compositional binding process.

**What our approach enables.** By tracking attention binding longitudinally across training, we can: (a) detect concept acquisition *before* behavioral competence emerges (early warning signal), (b) identify representational reorganization invisible to behavioral probes (coupling→decoupling transition), and (c) causally validate the functional role of binding through ablation, revealing opposite effects at different scales. This mechanistic perspective complements behavioral evaluation by explaining *how* knowledge is organized internally, not just whether it exists.

## 2.1 Mechanistic Interpretability

Mechanistic interpretability seeks to reverse-engineer the computational structure of neural networks into human-understandable components (Olah et al., 2020; Elhage et al., 2021). Within transformer language models, attention heads have been identified as key functional units: induction heads support in-context learning (Olsson et al., 2022), while specialized heads perform syntactic operations such as subject-verb agreement (Clark et al., 2019; Voita et al., 2019). Our work extends this line by identifying attention heads that bind multi-token concepts, using binding strength as a developmental marker rather than a static feature.

**Attention as compositional binding.** Recent theoretical work interprets self-attention as implementing vector-symbolic binding operations (Dhayalkar, 2025), where queries and keys define role spaces, values encode fillers, and attention weights perform soft unbinding. While this perspective provides a principled algebraic framework for understanding transformer reasoning, our work offers complementary empirical validation: we track how binding structure develops during training and correlates with behavioral competence, revealing developmental dynamics not captured by static architectural interpretations.

**Sparse autoencoders and monosemanticity.** An alternative approach to interpretability uses sparse autoencoders (SAEs) to decompose neural activations into interpretable features (Cunningham et al., 2023; Templeton et al., 2024). SAEs excel at discovering monosemantic features—individual directions corresponding to specific concepts. Our approach differs in focus: rather than decomposing activations into atomic features, we track compositional binding of multi-token concepts through attention patterns. These approaches are complementary—SAEs identify what features exist, while attention binding reveals how multi-token concepts are compositionally organized and how this organization evolves during training.

**Attention entropy as a measurement tool.** Recent work has used attention entropy to characterize attention patterns as focused versus diffuse (Clark et al., 2019). High entropy indicates attention is uniformly distributed across tokens, while low entropy indicates concentration on specific positions. Zhang et al. (2025) demonstrate that in parallel context encoding settings, irregularly high attention entropy correlates strongly with performance degradation (Pearson r ≈ 0.95), with elevated entropy signaling representational confusion that impairs information retrieval. Our analysis (§4.1.2) builds on this foundation, showing that binding measurement requires low-entropy (focused) attention—when attention becomes uniformly diffuse, EB\* correctly reports absence of binding structure rather than measurement failure.

## 2.2 Concept Emergence During Training

The study of how knowledge emerges during training has gained traction through training dynamics analyses. Pythia (Biderman et al., 2023) provides a controlled suite of models with public intermediate checkpoints, enabling longitudinal study. Prior work has examined the emergence of factual knowledge (Swayamdipta et al., 2020), syntactic competence (Choshen et al., 2022), and reasoning abilities (Wei et al., 2022) during training. Our contribution is to track a *mechanistic* signal—attention binding—alongside behavioral competence, revealing that internal structure can precede, decouple from, or even antagonize external capability depending on model scale.

**Grokking and phase transitions.** The coupling→decoupling transition we observe shares conceptual similarities with "grokking" (Power et al., 2022)—sudden generalization after prolonged memorization in algorithmic tasks. However, grokking describes behavioral transitions (from memorization to generalization), while we observe mechanistic reorganization (from binding-dependent to distributed representations) that can occur independently of behavioral performance. Our finding that binding can decouple from behavior at larger scales suggests these are distinct phenomena: grokking reflects behavioral phase transitions, while coupling-decoupling reflects architectural reorganization that may enable, coincide with, or follow behavioral improvements depending on model capacity.

## 2.3 Attention Head Ablation and Causal Analysis

Head ablation (zeroing or mean-ablating attention outputs) is a standard technique for assessing the causal importance of individual heads (Voita et al., 2019; Michel et al., 2019). Recent work has refined this approach through activation patching (Wang et al., 2023), path patching (Goldowsky-Dill et al., 2023), and learned causal gating (Nam et al., 2025). A recent survey (Kadem & Zheng, 2026) traces the evolution from visualization to intervention-based causal interpretability, highlighting trade-offs between intervention granularity and computational cost.

We adopt simple zero-ablation of attention patterns rather than these more sophisticated techniques for three reasons: (1) computational efficiency—activation patching requires multiple forward passes, (2) interpretability—zero-ablation has clear causal semantics (complete removal of a component), and (3) sufficiency—our discriminant validity pattern (only top-binding heads produce effects; random and bottom heads show no effect) suggests zero-ablation is adequate for testing our hypotheses about binding head necessity. Future work could use attribution patching or causal scrubbing for finer-grained mechanistic analysis (§5.5).

## 2.4 Accessibility in NLP

Web accessibility standards (WCAG; W3C, 2018) define requirements for making digital content usable by people with disabilities. While NLP systems are increasingly used to generate web content, accessibility-aware evaluation of language models remains limited. Prior work has examined bias in assistive technology descriptions (Trewin et al., 2019) and accessibility of AI-generated content (Gleason et al., 2020). Our work is, to our knowledge, the first to use accessibility concepts as a domain for studying mechanistic concept emergence in language models—chosen because these terms are multi-token, domain-specific, and have clear ground-truth evaluations.
