Localizing RL-Induced Tool Use to a Single Crosscoder Feature

Published: 11 Jun 2026, Last Modified: 25 Jun 2026Mech Interp Workshop ICML 2026 SpotlightEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Methods (probing, steering, causal interventions), Concept Discovery (e.g., SAEs, dictionary learning), Applications of interpretability
Other Keywords: dedicated feature crosscoders, model diffing, crosscoders, pragmatic interp
TL;DR: The capability gap that RL opens between a base model and its tuned counterpart can be compressed into one feature direction, and that direction leaks back into the base model through shared decoder weights.
Abstract: Fine-tuning through RL reshapes the internal representations of language models to enable agentic behaviors such as tool use, yet the mechanistic basis of these changes remains poorly understood. While RL substantially improves structured tool-call generation, it is unclear which features emerge, which are preserved, and whether identified features can be leveraged for retraining-free behavioral control. In this work, we show that $\textit{Dedicated Feature Crosscoders (DFC)}$ isolate a compact set of RL-specific features that mediate tool-calling capability in $\texttt{Qwen2.5-3B}$. Across a $48$-crosscoder hyperparameter sweep, encode-decode reconstruction improves the RL model's tool correctness by $+31.1 \pm {9.7}$ pp and passively transfers tool-calling ability to the frozen base model by $+6.8 \pm 5.0$ pp which we call a $\textit{capability spillover}$. Our findings show that DFC partitioning concentrates RL-introduced capability into a minimal, steerable feature set that enables runtime behavioral control of agentic LLMs.
Submission Number: 546
Loading