Towards a Mechanistic Understanding of Robustness in Finetuned Reasoning Models

Published: 30 Sept 2025, Last Modified: 30 Sept 2025Mech Interp Workshop (NeurIPS 2025) SpotlightEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Sparse Autoencoders, Chain of Thought/Reasoning models
TL;DR: We mechanistically show that finetuning-induced brittleness is not inevitable, as the feature repurposing essential for learning reasoning is causally separable from the unnecessary feature suppression that accompanies it.
Abstract: Supervised fine-tuning (SFT) on chain-of-thought data induces brittleness in language models, improving reasoning capabilities while severely degrading general performance. We provide the first mechanistic explanation for this trade-off through three complementary techniques: crosscoders for mapping feature transformations, Fisher Information-based identification of causal features, and gradient blocking for intervention experiments. Our analysis reveals that SFT operates through two distinct mechanisms—repurposing shared features for reasoning tasks and suppressing base-only features. Fisher Information with Sparse Autoencoders identifies the specific features responsible for reasoning, validated through feature steering that achieves 3.46\% performance gains on base models. Crosscoder analysis demonstrates that SFT repurposes existing reasoning capabilities in the base model rather than creating new ones. Gradient blocking experiments prove these mechanisms are separable: blocking shared features eliminates reasoning entirely, while blocking base-only features preserves it, demonstrating that base feature suppression is unnecessary for reasoning. This mechanistic understanding provides the foundation for developing surgical training methods that preserve general capabilities while enhancing reasoning.
Submission Number: 268
Loading