Keywords: Large language model, reasoning model, chain-of-thought, reinforcement learning
TL;DR: A novel RL framework for training reasoning model with dynamic supervision incoporation.
Abstract: Reinforcement learning with verifiable rewards (RLVR) has advanced the reasoning capabilities of large language models (LLMs).
However, existing RLVR methods often suffer from exploration inefficiency due to mismatches between problem difficulty and model capability: overly difficult problems hinder reasoning path discovery, while overly simple problems offer little learning signal.
To address this, we first formalize the effect of problem difficulty by quantifying the relationship between loss descent magnitude and rollout accuracy.
Building on this analysis, we propose SEELE, a supervision-aided RLVR framework that dynamically adjusts problem difficulty to lie within the high-performance region.
SEELE augments each training sample by appending a hint (part of a full solution) for difficulty reduction.
Unlike previous hint-based approaches, SEELE deliberately computes the hint length for each individual problem to achieve an optimal difficulty.
The optimal hint length is determined via multi-round rollout sampling, where an item response theory model fits accuracy–hint pairs from previous rounds to predict the next-round hint.
This instance-level, real-time difficulty adjustment aligns problem difficulty with the evolving model capability, thereby improving exploration efficiency.
Experiments show that SEELE outperforms Group Relative Policy Optimization (GRPO) and Supervised Fine-tuning (SFT) by +10.0 and +8.4 points, respectively, and exceeds the best prior supervision-aided approach by +3.8 points on average across six math reasoning benchmarks.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 5951
Loading