Different Heads, Same Fragile Tasks: A Cross-model Retrieval Head Correspondence

Published: 11 Jun 2026, Last Modified: 11 Jun 2026Mech Interp Workshop ICML 2026 VirtualposterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Methods (probing, steering, causal interventions), Benchmarking Interpretability
Other Keywords: mechanistic interpretability, retrieval heads, attention heads, long-context, activation patching, head ablation, cross-model analysis, factual retrieval
TL;DR: We show that QRScore-selected retrieval heads causally mediate SEC long-context fact extraction, are shared across related tasks within a model, and vary across models while preserving which tasks are most fragile.
Abstract: Long-context language models rely on a sparse subset of attention heads for factual retrieval. It remains unclear whether these retrieval heads form task-specific mechanisms, a shared head pool, or model-specific artifacts. We test this on eight fact-extraction tasks built from SEC 10-K filings, run on three open-weight 7--8B instruction-tuned models. For each task and model, we rank query-focused retrieval heads and ablate the top-$K$ heads of one source task while measuring accuracy on every target task. Ablations transfer broadly across tasks within each model. Across models, source-task head groups are not consistently destructive: after controlling for ablation size, source disruptivity correlates only weakly across model pairs ($R^2 = 0.02$--$0.15$). The target tasks that collapse under ablation are partially shared across models ($R^2 = 0.18$--$0.59$). Activation patching shows that QR-head activations causally carry answer information: clean QR-head outputs recover a substantial-to-near-full fraction of the log-probability lost to ablation, while bottom-ranked, random, and same-layer non-QR controls do not. Different models implement entity extraction with different head populations, but they agree on which tasks are vulnerable to retrieval-head removal.
Submission Number: 664
Loading