Keywords: Large Language Models, Policy Optimization
TL;DR: We propose EXPO, a reinforcement learning method that amplifies rare valuable responses in LLM training to promote exploration and improve performance.
Abstract: Reinforcement learning has become the standard approach for aligning large language models to complex reasoning tasks. However, these methods often overlook rare valuable responses, as learning signals are dominated by high-probability, frequently sampled outputs. To address this, we propose EXploration-Enhanced Policy Optimization (EXPO), a novel approach that dynamically reweights the advantage of each response based on its generation probability. EXPO amplifies gradients from rare valuable samples, ensuring they contribute meaningfully to policy updates and guide the model toward underexplored, high-value solutions. We evaluate EXPO on multiple mathematical reasoning benchmarks. It consistently outperforms strong baselines across model scales: on Qwen2.5-Math-1.5B, EXPO surpasses DAPO by +3.0\%; on Llama-3.2-3B-Instruct, by +3.6\%; and on the larger Qwen2.5-Math-7B, it outperforms the DAPO by +4.6\%, Dr.GRPO by +5.3\% and instruction-tuned baseline by +9.1\%, These gains demonstrate EXPO’s effectiveness in leveraging valuable but underrepresented responses for better policy learning.
Primary Area: reinforcement learning
Submission Number: 10956
Loading