Leveraging Instruction Tuning and Merging for Reasoning Model Adaptation

Published: 26 May 2026, Last Modified: 26 May 2026ICML 2026 FoGen Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: reasoning, llm, model merging, finetuning, rlvr, verifiable rewards
Abstract: Large reasoning models (LRM) have demonstrated impressive performance in domains such as mathematics and coding. These domains permit reliable verification of model outputs, important for enabling the reinforcement learning that drives LRM performance gains. However, training reasoning models on domains that lack reliable verifiers remains challenging. Meanwhile, for both verifiable and unverifiable domains, there exist large amounts of unused instruction-tuning data with human-written solutions. In this work, we show that this instruction-tuning data can be efficiently utilized to further improve reasoning models. For this, we first use classic instruction tuning, without reasoning traces, on the LRM. Next, we merge our instruction-tuned model with the original reasoning model, recovering its reasoning behavior on the target domain. Our extensive evaluation demonstrates that our technique improves LRM performance in both verifiable and hard-to-verify domains, including coding and text summarization, while preserving LRM capabilities across other domains. Importantly, our method is highly efficient, enabling such improvements for just a few tens of dollars.
Submission Number: 130
Loading