Keywords: Reasoning models, efficient reasoning, sparse autoencoder
TL;DR: We repurpose sparse autoencoders from interpreting models to efficiently enhancing them with strong reasoning abilities.
Abstract: How cost-effectively can we elicit strong reasoning abilities in language models by leveraging their underlying representations? We present Resa, a family of reasoning models trained via an efficient sparse autoencoder tuning (SAE‑Tuning) procedure. This method first trains an SAE to capture reasoning abilities from a source model, and then uses the trained SAE to guide a standard supervised fine-tuning process to elicit such abilities in a target model, all using verified question-answer data *without any reasoning traces*. When applied to certain Qwen-style models before further RL training, SAE‑Tuning retains 97\% of its RL‑trained counterpart’s performance while reducing training costs by 2000x to roughly \$1 and training time by 450x to around 20 minutes. Furthermore, even at the 1.5B model size, SAE-Tuning on lightly RL-trained models delivers strong reasoning results, reaching 43.33\% Pass@1 on AIME24 and 90\% Pass@1 on AMC23. We also show that SAE-Tuning works for Llama-style models, boosting their scores by over 10\% on tasks like AMC23 and MATH500. Surprisingly, the reasoning abilities extracted via SAEs are potentially both generalizable and modular. Generality means abilities extracted from one dataset still elevate performance on a larger and overlapping corpus. Modularity means abilities extracted from models like Qwen or Qwen‑Math can be attached to the R1‑Distilled Qwen model at test time, *without any retraining*, and yield comparable gains. Extensive ablations validate these findings and all artifacts are fully open-sourced.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 9681
Loading