Keywords: Large Language Models, Group Relative Policy Optimization
Abstract: Reinforcement Learning (RL) has become a key driver for enhancing the long chain-of-thought (CoT) reasoning capabilities of Large Language Models (LLMs).
However, prevalent methods like GRPO often fail when task difficulty exceeds model capacity, leading to reward sparsity and inefficient training.
While prior work attempts to mitigate this using off-policy data often induce severe distributional mismatches that destabilize policy updates.
In this work, we identify a core issue underlying these failures, which we term low training affinity, and introduce Affinity, the first quantitative metric for monitoring the compatibility between external guidance and the model's intrinsic policy.
To address this, we propose HINT, an adaptive framework designed to enhance reasoning capabilities while explicitly preserving high Affinity.
First, instead of revealing partial answers, HINT supplies Meta-Hints, which act as abstract cognitive scaffolding to guide the model in articulating solutions independently.
Second, to ensure stability, we integrate Affinity-Aware Policy Optimization (AAPO), which dynamically modulates the learning objective based on the Affinity.
Extensive experiments across diverse benchmarks demonstrate that HINT achieves state-of-the-art performance, exhibiting superior stability and robust generalization to out-of-distribution tasks.
Code is available on Github.
Paper Type: Long
Research Area: Question Answering
Research Area Keywords: Question Answering
Languages Studied: English
Submission Number: 9103
Loading