Staying in the Sweet Spot: Responsive Reasoning Evolution via Capability-Adaptive Hint Scaffolding

Ziheng Li; Zexu Sun; Jinman Zhao; Erxue Min; Yongcheng Zeng; Hui Wu; Hengyi Cai; Shuaiqiang Wang; Xu Chen; Zhi-Hong Deng; Dawei Yin

Staying in the Sweet Spot: Responsive Reasoning Evolution via Capability-Adaptive Hint Scaffolding

Ziheng Li, Zexu Sun, Jinman Zhao, Erxue Min, Yongcheng Zeng, Hui Wu, Hengyi Cai, Shuaiqiang Wang, Xu Chen, Zhi-Hong Deng, Dawei Yin

15 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large language model, reasoning model, chain-of-thought, reinforcement learning

TL;DR: A novel RL framework for training reasoning model with dynamic supervision incoporation.

Abstract: Reinforcement learning with verifiable rewards (RLVR) has advanced the reasoning capabilities of large language models (LLMs). However, existing RLVR methods often suffer from exploration inefficiency due to mismatches between problem difficulty and model capability: overly difficult problems hinder reasoning path discovery, while overly simple problems offer little learning signal. To address this, we first formalize the effect of problem difficulty by quantifying the relationship between loss descent magnitude and rollout accuracy. Building on this analysis, we propose SEELE, a supervision-aided RLVR framework that dynamically adjusts problem difficulty to lie within the high-performance region. SEELE augments each training sample by appending a hint (part of a full solution) for difficulty reduction. Unlike previous hint-based approaches, SEELE deliberately computes the hint length for each individual problem to achieve an optimal difficulty. The optimal hint length is determined via multi-round rollout sampling, where an item response theory model fits accuracy–hint pairs from previous rounds to predict the next-round hint. This instance-level, real-time difficulty adjustment aligns problem difficulty with the evolving model capability, thereby improving exploration efficiency. Experiments show that SEELE outperforms Group Relative Policy Optimization (GRPO) and Supervised Fine-tuning (SFT) by +10.0 and +8.4 points, respectively, and exceeds the best prior supervision-aided approach by +3.8 points on average across six math reasoning benchmarks.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 5951

Loading