Keywords: Mechanistic Interpretability, Large Language Models, Domain control, layer analysis, activation patching
TL;DR: In LLMs, attention layers act as domain "routers" while MLP layers store and perform the domain-specific "computation" in high-level domain abstraction
Abstract: Large language models (LLMs) perform well across diverse domains such as programming, medicine, and law, yet it remains unclear how domain information is represented and distributed within their internal mechanisms. A key open question is the \textit{division of labor} between the Transformer's core components: self-attention and MLP layers. We address this question through a mechanistic study that dissects their roles by integrating three complementary analyses: representation separability via probes, parameter change under adaptation, and causal effects from activation swaps. Across six domains and multiple models, we find that both Attention and MLP layers encode domain information, but in systematically different ways. We find that attention layers concentrate domain information in localized 'hotspots' (high variance across depth), while MLP layers distribute it uniformly. During fine-tuning, MLPs absorb 2-3× larger parameter updates, yet causal interventions reveal that specific mid-depth attention layers (e.g., layers 13-15) directionally steer domain predictions, while MLP interventions disrupt computation without directional control. These three lenses jointly support a coherent functional picture: MLP layers serve as the primary workbenches for domain-specific computation, while a small subset of attention layers act as high-gain steering points that route domain identity. Finally, we show a proof-of-concept parameter-efficient adaptation setup where tuning only the layers highlighted by our analysis matches full-model fine-tuning on domain benchmarks, illustrating the practical potential of mechanistically informed PEFT.
Supplementary Material: zip
Primary Area: interpretability and explainable AI
Submission Number: 21377
Loading