Don't Tell the Answer, Truly Guide the Reasoning During RL Rollouts

Don't Tell the Answer, Truly Guide the Reasoning During RL Rollouts

ACL ARR 2026 January Submission9103 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models, Group Relative Policy Optimization

Abstract: Reinforcement Learning (RL) has become a key driver for enhancing the long chain-of-thought (CoT) reasoning capabilities of Large Language Models (LLMs). However, prevalent methods like GRPO often fail when task difficulty exceeds model capacity, leading to reward sparsity and inefficient training. While prior work attempts to mitigate this using off-policy data often induce severe distributional mismatches that destabilize policy updates. In this work, we identify a core issue underlying these failures, which we term low training affinity, and introduce Affinity, the first quantitative metric for monitoring the compatibility between external guidance and the model's intrinsic policy. To address this, we propose HINT, an adaptive framework designed to enhance reasoning capabilities while explicitly preserving high Affinity. First, instead of revealing partial answers, HINT supplies Meta-Hints, which act as abstract cognitive scaffolding to guide the model in articulating solutions independently. Second, to ensure stability, we integrate Affinity-Aware Policy Optimization (AAPO), which dynamically modulates the learning objective based on the Affinity. Extensive experiments across diverse benchmarks demonstrate that HINT achieves state-of-the-art performance, exhibiting superior stability and robust generalization to out-of-distribution tasks. Code is available on Github.

Paper Type: Long

Research Area: Question Answering

Research Area Keywords: Question Answering

Languages Studied: English

Submission Number: 9103

Loading