CURE: Critical-Token-Guided Re-Concatenation for Entropy-Collapse Prevention

ACL ARR 2026 January Submission3765 Authors

04 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Models, Mathematical Reasoning, Reinforcement Learning with Verified Rewards
Abstract: Recent advances in Reinforcement Learning with Verified Reward (RLVR) have significantly bolstered the reasoning capabilities of Large Language Models (LLMs). However, conventional RLVR pipelines often rely on static initial-state sampling, leading to overly deterministic behavior, rapid entropy collapse, and plateaued performance during extended training. To mitigate this, we propose CURE (Critical-token-gUided Re-concatenation for Entropy-collapse prevention), a two-stage framework balancing exploration and exploitation. In Stage 1, CURE encourages exploration by re-generating branched trajectories at high-entropy critical tokens, jointly optimizing them with original paths to maintain diversity. Compared to vanilla DAPO, this stage yields superior reasoning performance while preserving high entropy. In Stage 2, we transition to static sampling using DAPO, placing the model in familiar states to consolidate exploitation. Extensive experiments on Qwen-2.5-Math-7B demonstrate that CURE outperforms existing RLVR methods by $5$\% across six math benchmarks, achieving state-of-the-art results in both reasoning accuracy and entropy maintenance.
Paper Type: Long
Research Area: Machine Learning for NLP
Research Area Keywords: reinforcement learning, reasoning, logical reasoning
Languages Studied: English
Submission Number: 3765
Loading