Disentangling Recall and Reasoning in Transformer Models through Layer-wise Attention and Activation Analysis

Published: 11 Nov 2025, Last Modified: 23 Dec 2025XAI4Science Workshop 2026EveryoneRevisionsBibTeXCC BY 4.0
Track: Tiny Paper Track (Page limit: 3-5 pages)
Keywords: transformer interpretability, recall vs reasoning, activation patching, attention analysis, mechanistic interpretability, layer specialization
TL;DR: Causal evidence that transformer models use separable circuits for factual recall and reasoning, revealed through layer-wise activation and attention analysis.
Abstract: Transformer-based language models excel at both recall (retrieving memorized facts) and reasoning (performing multistep inference), yet it remains unclear whether these functions rely on overlapping or separable internal circuits. Understanding these mechanisms is critical for building trustworthy and scientifically interpretable AI systems.We address this question through mechanistic interpretability, using controlled linguistic puzzles to probe transformer models at the layer, head, and neuron levels. By combining layer-wise activation tracing, attention-head specialization metrics, and causal activation patching, we identify subnetworks whose perturbation selectively disrupts either factual retrieval or reasoning. Across the Qwen, LLaMA-3 and Mistral families, we find a consistent pattern: the early and middle layers primarily support recall, while the deeper layers and specific MLP pathways enable reasoning. Interventions in distinct components lead to selective impairments: Disabling identified recall circuits reduces the factual precision by up to 15 % while leaving the reasoning intact, while disabling reasoning circuits yields a comparable drop in multistep inference. These findings offer causal, interpretable evidence that recall and reasoning arise from partially distinct but complementary computational processes, advancing the mechanistic foundations of explainable AI.
Submission Number: 38
Loading