On theEmergence ofThinking in LLMs I: Searching for the Right Intuition

Khiem Pham

Published: 01 Feb 2025, Last Modified: 01 Sept 2025OpenReview Archive Direct UploadEveryoneCC BY 4.0

Abstract: Recent advancements in AI, such as OpenAI’s new o models, Google’s Gemini Thinking model, andDeepseekR1,aretransformingLLMsintoLRMs(LargeReasoningModels). UnlikeLLMs,LRMs performthinkingorreasoningduringinference,takingadditionaltimeandcomputetoproducehigher- qualityoutputs. ThisworkaimstodiscoverthealgorithmicframeworkbehindtrainingLRMs. Approaches based on self-consistency, process reward modeling, AlphaZero, highlight that reasoning is a form of guidedsearch. Buildingonthisprinciple,weask: whatisthesimplestandmostscalablewaytoimplement searchin thecontextofLLMs? Towards answering these questions, we propose a post-training framework called Reinforcement Learning via Self-Play (RLSP). RLSP involves three steps: (1) supervised fine-tuning with human or synthetic demonstrations of the reasoning process, whenever possible (2) using an exploration reward signaltoencouragediverseandefficientreasoningbehaviors,and(3)RLtrainingwithanoutcomeverifier to ensure correctness while preventing reward hacking. Our key innovation is to decouple exploration and correctness signals during PPO training, carefully balancing them to improve performance and efficiency. We perform empirical studies of the RLSP framework in the math domain, and show that the models trained with the RLSP framework demonstrated improved reasoning abilities. On Llama-3.1-8B-Instruct model the RLSP framework can boost performance by 23% in MATH-500 test set; On AIME 2024 math problems, Qwen2.5-32B-Instructimproved by 10% duetoRLSP technique. ThemoreimportantfindingofthisworkisthatthemodelstrainedusingRLSPtechnique,evenwiththe simplestexplorationrewardthatencouragesthemodeltotakemoreintermediatestepsbeforearrivingata solution,showedseveralemergentbehaviorssuchasbacktracking,explorationofideas,andverification. Furthermore,ourframeworkenablessuchemergentbehaviorsacrossmultiplemodelfamilies,sizes, and domains. These findings demonstrate that RLSP framework might be enough to enable the emergence of complexreasoningabilities inLLMswhen scaled appropriately. Lastly, we propose a theory as to why RLSP search strategy is more suitable for LLMs compared topreviousapproachesconsideredintheliterature,inspiredbyaremarkablerecentresultthatsaysthat CoT provably increases computation power of LLMs, and hence reasoning, and these abilities grow as the number of steps in CoT [LLZM24, MS23]. Our code is available at: https://github.com/ GuanghaoYe/Emergence-of-Thinking.