Localizing Reasoning Training-Induced Changes in Large Language Models

Published: 30 Sept 2025, Last Modified: 30 Sept 2025Mech Interp Workshop (NeurIPS 2025) PosterEveryoneRevisionsBibTeXCC BY 4.0
Open Source Links: https://github.com/mklabunde/localizing-reasoning
Keywords: Chain of Thought/Reasoning models
TL;DR: We compare reasoning models to their base models to identify which model components are most affected by reasoning training.
Abstract: Reasoning language models demonstrate excellent performance, but how reasoning training transforms models internally remains poorly understood. We systematically analyze where reasoning training induces changes by studying 19 publicly released reasoning models based on Qwen, trained using supervised fine-tuning (SFT) or reinforcement learning (RL). Using direct weight comparison and representation similarity via centered kernel alignment (CKA), we reveal a striking discrepancy: weight differences are broadly distributed, while CKA highlights middle layers as most strongly changed. We trace the CKA effect to "dominant datapoints" as proposed by Nguyen et al.--tokens with large activations that might affect attention mechanisms. Additionally, SFT induces substantially larger changes than RL. Our work makes steps towards better understanding of reasoning training.
Submission Number: 190
Loading