HiPO: Self-Hint Policy Optimization for RLVR

ICLR 2026 Conference Submission23076 Authors

Published: 26 Jan 2026, Last Modified: 26 Jan 2026ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Reinforcement Learning, Large Language Models, Mathematical Reasoning
Abstract: Reinforcement Learning from Verifiable Rewards (RLVR) is a promising method for enhancing the complex problem-solving abilities of large language models (LLMs). This is particularly evident in domains requiring long-horizon reasoning and precise execution, such as solving complex mathematical problems where solutions hinge on a fragile sequence of tool-based actions. However, current approaches are often crippled by two interconnected issues: the near-miss problem, where sparse rewards nullify the learning signal for almost-correct attempts, and the resulting exploration stagnation, which prevents the model from discovering better solutions. To address these challenges, we introduce HiPO (Hint-guided Policy Optimization), a novel RLVR framework that enables the agent to learn from its own rare successes. Our core insight is to capture an occasional successful trajectory within a training batch and repurpose its initial correct steps as an on-policy “hint”. This process transforms a single, stochastically-found success into a dense contrastive learning signal, effectively allowing the model to teach itself how to overcome the near-miss problem and break exploration stagnation. On a challenging suite of five mathematical reasoning benchmarks, HiPO improves the average avg@32 by +5.0 percentage points (pp) over the strong GRPO baseline. This improvement is driven by substantial absolute point gains on challenging datasets, including +10.3 pp on CMIMC 2025, +4.9 pp on BRUMO 2025, +4.6 pp on AIME 2024, and +3.1 pp on AIME 2025. Furthermore, HiPO demonstrates a new exploration paradigm, repurposing rare successes into reusable guidance to significantly accelerate skill acquisition for complex tasks, establishing a more efficient and scalable path for models to autonomously master intricate reasoning.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 23076
Loading