How Reasoning Evolves from Post-Training Data in Sequential Decision-Making Domains

ICLR 2026 Conference Submission15799 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Reasoning, Reasoning Models, Markov Decision Process, MDP, Language Models, Supervised Fine-Tuning, Reinforcement Learning, RLVR, Dataset Design, Chess
TL;DR: We study how reasoning evolves (both qualitative and quantitative performance) from custom post-training datasets through SFT and RL on a sequential decision domain (e.g., chess).
Abstract: We study how reasoning evolves in a language model -- from supervised fine-tuning (SFT) to reinforcement learning (RL) -- by analyzing how a set of theoretically-inspired datasets impacts language model performance in a verifiable Markov Decision Process (MDP) such as chess. We find that fine-tuning a model to directly predict the best move leads to effective RL and the strongest downstream performance -- however, the RL stage elicits $\textit{unfaithful}$ reasoning (reasoning inconsistent with the chosen move). Alternatively, training on multi-move trajectories yields comparable downstream performance with faithful reasoning and more stable RL. We show that RL induces a substantial positive shift in the distribution of move quality and reduces hallucination rates as a side effect. Finally, we find several SFT-checkpoint metrics -- metrics spanning evaluation performance, hallucination rates, and reasoning quality -- to be predictive of post-RL model performance. We release checkpoints and final models as well as training data, evaluations, and code which allowed us to surpass leading open-source reasoning models in chess with a 7B-parameter model.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 15799
Loading