Keywords: Reinforcement Learning, Large Language Models, Mathematical Reasoning
Abstract: Reinforcement Learning from Verifiable Rewards (RLVR) is a promising method for enhancing the complex
problem-solving abilities of large language models (LLMs). This is particularly evident in domains requiring
long-horizon reasoning and precise execution, such as solving complex mathematical problems where solutions
hinge on a fragile sequence of tool-based actions. However, current approaches are often crippled by two
interconnected issues: the near-miss problem, where sparse rewards nullify the learning signal for
almost-correct attempts, and the resulting exploration stagnation, which prevents the model from
discovering better solutions. To address these challenges, we introduce HiPO (Hint-guided Policy Optimization),
a novel RLVR framework that enables the agent to learn from its own rare successes.
Our core insight is to capture an occasional successful trajectory within a training batch and
repurpose its initial correct steps as an on-policy “hint”. This process
transforms a single, stochastically-found success into a dense contrastive learning signal,
effectively allowing the model to teach itself how to overcome the near-miss
problem and break exploration stagnation. On a challenging suite of five mathematical reasoning benchmarks,
HiPO improves the average avg@32 by +5.0 percentage points (pp) over the strong GRPO baseline.
This improvement is driven by substantial absolute point gains on challenging datasets,
including +10.3 pp on CMIMC 2025, +4.9 pp on BRUMO 2025, +4.6 pp on AIME 2024, and +3.1 pp on AIME 2025.
Furthermore, HiPO demonstrates a new exploration paradigm,
repurposing rare successes into reusable guidance to significantly accelerate skill acquisition for complex tasks,
establishing a more efficient and scalable path for models to autonomously master intricate reasoning.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 23076
Loading