DeRL: Diverse‑Exploration Reinforcement Learning for Large Language Models Improves Mathematical Reasoning

Chenyang An; Zhizhen Qin; Soonho Kong; Cedric Flamant

DeRL: Diverse‑Exploration Reinforcement Learning for Large Language Models Improves Mathematical Reasoning

Chenyang An, Zhizhen Qin, Soonho Kong, Cedric Flamant

16 Sept 2025 (modified: 03 Dec 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM, Reinforcement Learning, Reasoning

TL;DR: Rewarding diverse solution generation in reinforcement learning has been shown in this work to enhance LLM performance on reasoning tasks.

Abstract: Current reinforcement-learning (RL) pipelines for large language models (LLMs) that tackle mathematical reasoning and formal theorem proving tend to over-exploit a few high-probability chain-of-thought (CoT) sequences. Because rewards are granted solely for producing correct answers, the policy quickly converges on those paths, neglecting the rich space of alternative proofs and solution strategies that math problems usually have. We address this limitation with Diverse-Exploration RL (DeRL), a simple yet effective modification to the reward function and the RL prompts. During training, the model is explicitly instructed to solve each problem without relying on its previously generated CoT. Next, an auxiliary LLM judge verifies the approach dissimilarity between the new LLM output and the previous CoT. Combined with the correctness metric, this new reward encourages exploration of novel reasoning paths while preserving accuracy. We test DeRL on both natural-language math questions with boxed answers and formal theorem proving problems in Lean. Across the MATH benchmark and Leanabell dataset, DeRL yields more than 10% relative gain compared to the PPO baseline for the Pass@1 metric. DeRL also consistently yields better results for the Pass@N metric. Our findings demonstrate that incorporating diversity-aware rewards facilitates broader exploration and enhances reasoning capabilities of LLMs, indicating a promising direction for improving current reinforcement learning pipelines.

Primary Area: reinforcement learning

Submission Number: 6421

Loading