Keywords: Linear Algebra, Math Deductive Reasoning, Reinforcement Learning, Verifiable Reward, Honesty Alignment, Curriculum Learning, Language Models
TL;DR: We propose a stabilization method for reinforcement learning with verifiable rewards on mathematically structured reasoning datasets that injects ground truth trajectories, improving honesty alignment in language models.
Abstract: Reinforcement learning with verifiable rewards (RLVR) has recently emerged as a promising framework for aligning language models with complex reasoning objectives, but training stability remains a key challenge. Standard approaches often optimize only for final task outcomes, which can cause gradient instability when negative rewards dominate early learning. We study this issue in the context of deductive reasoning, a domain that isolates reasoning dynamics from external factual knowledge. To systematically analyze this behavior, we construct two graph-based reasoning datasets, one rooted in linear algebra and one in logical inference, each containing both solvable and unsolvable cases. We find that conventional optimization strategies, such as GRPO and curriculum learning, are sensitive to reward imbalance and task difficulty. To address these limitations, we propose ANCHOR, a reinforcement learning method that incorporates verifiable reference trajectories into rollouts to maintain stable optimization. This addition introduces a bounded positive reference signal that prevents gradient collapse. Experiments across multiple reasoning models show that ANCHOR improves convergence stability and consistency in multi-step reasoning. These results suggest a mathematically grounded approach to stabilizing reinforcement optimization in structured reasoning architectures.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 11
Loading