Outcome-based Exploration for LLM Reasoning

Published: 23 Sept 2025, Last Modified: 01 Dec 2025ARLETEveryoneRevisionsBibTeXCC BY 4.0
Track: Research Track
Keywords: exploration, llm reasoning
Abstract: Reinforcement learning (RL) has become a powerful tool for improving the reasoning ability of large language models (LLMs). While outcome-based RL, which rewards models solely on the correctness of the final answer, achieves strong accuracy gains, it also causes a systematic loss of diversity in generations. This collapse undermines real-world performance, where diversity is essential for test-time scaling. We analyze this phenomenon by viewing RL post-training as a sampling process and uncover two key properties: (i) transfer of diversity degradation, where reduced diversity on solved problems propagates to unsolved ones, and (ii) tractability of the outcome space, since reasoning tasks admit only a limited set of distinct answers. Motivated by these insights, we propose outcome-based exploration, which assigns exploration bonuses based only on final outcomes. We introduce two complementary algorithms: historical exploration, which rewards rarely observed answers via UCB-style bonuses, and batch exploration, which penalizes repetition within a batch to promote test-time diversity. Experiments across multiple models and datasets show that both methods improve accuracy while mitigating diversity collapse. Together, they offer a practical path toward RL methods that enhance LLM reasoning without sacrificing the diversity critical for scalable deployment.
Submission Number: 118
Loading