Abstract: Recent advancements in AI, such as OpenAI’s new o models, Google’s Gemini Thinking model,
andDeepseekR1,aretransformingLLMsintoLRMs(LargeReasoningModels). UnlikeLLMs,LRMs
performthinkingorreasoningduringinference,takingadditionaltimeandcomputetoproducehigher-
qualityoutputs. ThisworkaimstodiscoverthealgorithmicframeworkbehindtrainingLRMs. Approaches
based on self-consistency, process reward modeling, AlphaZero, highlight that reasoning is a form of
guidedsearch. Buildingonthisprinciple,weask: whatisthesimplestandmostscalablewaytoimplement
searchin thecontextofLLMs?
Towards answering these questions, we propose a post-training framework called Reinforcement
Learning via Self-Play (RLSP). RLSP involves three steps: (1) supervised fine-tuning with human or
synthetic demonstrations of the reasoning process, whenever possible (2) using an exploration reward
signaltoencouragediverseandefficientreasoningbehaviors,and(3)RLtrainingwithanoutcomeverifier
to ensure correctness while preventing reward hacking. Our key innovation is to decouple exploration and
correctness signals during PPO training, carefully balancing them to improve performance and efficiency.
We perform empirical studies of the RLSP framework in the math domain, and show that the models
trained with the RLSP framework demonstrated improved reasoning abilities. On Llama-3.1-8B-Instruct
model the RLSP framework can boost performance by 23% in MATH-500 test set; On AIME 2024 math
problems, Qwen2.5-32B-Instructimproved by 10% duetoRLSP technique.
ThemoreimportantfindingofthisworkisthatthemodelstrainedusingRLSPtechnique,evenwiththe
simplestexplorationrewardthatencouragesthemodeltotakemoreintermediatestepsbeforearrivingata
solution,showedseveralemergentbehaviorssuchasbacktracking,explorationofideas,andverification.
Furthermore,ourframeworkenablessuchemergentbehaviorsacrossmultiplemodelfamilies,sizes, and
domains. These findings demonstrate that RLSP framework might be enough to enable the emergence of
complexreasoningabilities inLLMswhen scaled appropriately.
Lastly, we propose a theory as to why RLSP search strategy is more suitable for LLMs compared
topreviousapproachesconsideredintheliterature,inspiredbyaremarkablerecentresultthatsaysthat
CoT provably increases computation power of LLMs, and hence reasoning, and these abilities grow
as the number of steps in CoT [LLZM24, MS23]. Our code is available at: https://github.com/
GuanghaoYe/Emergence-of-Thinking.
Loading