Keywords: reasoning models, reinforcement learning, activation patching, interpretability, causal intervention
TL;DR: Minimal cross-domain activation transfer from a solved formal prompt into its narrative twin boosts accuracy (+20.5 pts) in an RLVR-trained model, revealing a potential task-state invocation bottleneck rather than missing competence
Abstract: We investigate why reasoning improvements from reinforcement learning on chain-of-thought (RL-CoT) often fail to transfer across superficially different problem presentations. Using parallel datasets where identical logical problems are expressed as formal statements versus natural language narratives (n=200 problem pairs), we find that DeepSeek-R1-Distill-Qwen3-8B solves formal variants reliably but fails on isomorphic narrative versions. Through causal intervention experiments, we show this performance gap reflects failed invocation and not necessarily missing competence. Patching MLP activations (layers 12-18) from the final token of successful formal-problem runs into failed narrative-problem runs yields 20\% absolute accuracy improvement (Cohen's d=0.57), emergence of self-correction behaviors (increased occurrence of "wait," "alternatively" tokens), and longer but more productive chains-of-thought. Crucially, patching rescues problem-solving without introducing any new information, only activations from the same underlying problem in a different surface form. These results provide evidence that RL-CoT training produces reasoning computations that exist within the model but fail to activate consistently across problem framings. The narrow layer band (12-18) where patching succeeds, combined with degenerate behaviors when patching earlier layers, suggests these computations occupy specific neural localities rather than being distributed throughout the network, demonstrating that current RL methods produce reasoning capabilities keyed to training distribution surface features rather than abstract problem structure.
Submission Number: 208
Loading