Consistency Is Not Always Correct: Towards Understanding the Role of Exploration in Post-Training Reasoning

ICLR 2026 Conference Submission15942 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Tree-structured Markov Chain; Process Reward Model; RLVR; Squeezing Effect
Abstract: Foundation models exhibit broad knowledge but limited task-specific reasoning, motivating post-training strategies such as RL with verifiable rewards (RLVR) and inference scaling with outcome or process reward models (ORM/PRM). While recent work highlights the role of *exploration* and *entropy stability* in improving pass@K, empirical evidence points to a paradox: RLVR and ORM/PRM typically reinforce existing tree-like reasoning paths rather than expanding the reasoning scope, raising the question of why exploration helps at all if no new patterns emerge. To reconcile this paradox, we adopt the perspective of Kim et al. (2025), viewing easy (e.g., simplifying a fraction) versus hard (e.g., discovering a symmetry) reasoning steps as low- versus high-probability Markov transitions, and formalize post-training dynamics through Multi-task Tree-structured Markov Chains (TMC). In this tractable model, pretraining corresponds to tree expansion, while post-training corresponds to CoT reweighting. We provably show that several phenomena recently observed in empirical studies arise naturally in this setting: **(1)** RLVR induces a *squeezing effect*, reducing CoT entropy and forgetting some correct paths; **(2)** population rewards of ORM/PRM encourage consistency rather than accuracy, thereby favoring common patterns; and **(3)** certain rare, high-uncertainty CoTs by base model are responsible for solving hard problem instances. Together, these explain why exploration—even when confined to the base model’s tree scope—remains essential: it preserves access to rare but crucial CoTs needed for difficult cases, which are squeezed out by RLVR or unfavored by inference scaling. Building on this, we further prove that exploration strategies such as rejecting easy instances and KL regularization help preserve rare CoTs. Empirical simulations corroborate our theoretical results.
Primary Area: learning theory
Submission Number: 15942
Loading