Gluing Local Contexts into Global Meaning: A Sheaf-Theoretic Decomposition of Transformer Representations
Keywords: sheaf cohomology, transformer interpretability, representation decomposition, activation steering, mechanistic interpretability, spectral methods, paraphrase invariance
TL;DR: We decompose transformer representations into content-stable (H⁰) and context-dependent (H¹) subspaces via sheaf Laplacian spectral analysis
Abstract: We decompose transformer activations into content-stable ($H^0$) and context-dependent ($H^1$) subspaces using sheaf cohomology. A cellular sheaf built over paraphrase graphs yields a Laplacian whose spectral structure separates phrasing-invariant directions from maximally varying ones, requiring no concept labels or supervised training. Across five models (124M--13B parameters), $H^1$ dimensions exert $3.5$--$26.5\times$ greater causal influence on model output than variance-matched controls (Cohen's $d = 2.3$--$14.3$), $H^0$ retrieves facts at 60--68\% accuracy using only 20 dimensions, and the two subspaces produce opposite effects under ablation. The decomposition also reveals architecture-dependent fragility: Llama-2-7B collapses under random perturbation (4.2\% fact preservation) while all directed methods preserve facts at 12--14\% ($p < 10^{-10}$, $n$=1000); with architecture-specific restriction maps this gap widens to 31.0\% vs.\ 4.2\% ($p < 10^{-50}$). Robust models tolerate both perturbation types. Project page: https://cwru-aism.github.io/gluing-lc-page/
Submission Number: 43
Loading