EFRame: Deeper Reasoning via Exploration-Filter-Replay Reinforcement Learning Framework

ICLR 2026 Conference Submission3952 Authors

11 Sept 2025 (modified: 16 Nov 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large language model, Reinforcement learning, Reasoning, Exploration, Entropy
TL;DR: EFRame is an Exploration-Filter-Replay framework that enhances GRPO by improving exploration, stability, and efficiency for deeper reasoning in LLMs.
Abstract: Recent advances in reinforcement learning (RL) have significantly enhanced the reasoning capabilities of large language models (LLMs). Group Relative Policy Optimization (GRPO), a lightweight variant of Proximal Policy Optimization (PPO), improves efficiency but suffers from limited exploration and training instability, limiting its effectiveness on complex reasoning tasks. To address these challenges, we introduce EFRame, an Exploration-Filter-Replay framework that augments GRPO across three dimensions: additional rollouts enable deeper and more targeted exploration, online filtering removes low-quality samples to stabilize gradients and accelerate training, and experience replay amplifies rare yet informative trajectories for stable convergence. This unified framework establishes a principled training cycle that balances exploration, efficiency, and stability. Experiments on diverse reasoning benchmarks and evaluation settings demonstrate that EFRame achieves consistent gains, including a 37.9\% improvement on Geometry3K over GRPO, and exceeds RL baselines under Pass@K settings. EFRame further supports fine-grained sample categorization and precise entropy control, highlighting it as a robust solution for advancing deeper reasoning in LLMs.
Supplementary Material: zip
Primary Area: reinforcement learning
Submission Number: 3952
Loading