Toward Honest Language Models for Deductive Reasoning

Jiarui Liu; Kaustubh Dhole; Yingheng Wang; Haoyang Wen; Sarah Zhang; Haitao Mao; Gaotang Li; Neeraj Varshney; Jingguo Liu; Xiaoman Pan

Toward Honest Language Models for Deductive Reasoning

Jiarui Liu, Kaustubh Dhole, Yingheng Wang, Haoyang Wen, Sarah Zhang, Haitao Mao, Gaotang Li, Neeraj Varshney, Jingguo Liu, Xiaoman Pan

Published: 15 Nov 2025, Last Modified: 08 Mar 2026AAAI 2026 Bridge LMReasoningEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Linear Algebra, Math Deductive Reasoning, Symbolic Reasoning, Reinforcement Learning, Verifiable Reward, Honesty Alignment, Curriculum Learning, Language Models

TL;DR: We propose a reinforcement learning stabilization method with verifiable rewards that injects ground-truth reasoning trajectories, improving reasoning reliability and symbolic consistency in language models on structured deductive tasks.

Abstract: Reinforcement learning with verifiable rewards (RLVR) has recently emerged as a promising framework for improving the reasoning reliability of language models. However, most existing approaches optimize only for final task outcomes, leaving models vulnerable to gradient collapse when negative rewards dominate early training. This limitation is particularly evident in deductive reasoning, where models must determine not only when a conclusion follows from given premises but also when it does not. To examine this systematically, we construct two graph-structured reasoning benchmarks, one based on linear algebra and one on logical inference, each containing both solvable and unsolvable instances. These tasks provide a controlled setting for studying how neural models interact with structured dependencies similar to symbolic reasoning processes. We find that existing optimization methods such as GRPO and curriculum learning remain sensitive to reward imbalance and task difficulty. To address this, we propose ANCHOR, a reinforcement learning method that incorporates verifiable reference trajectories into rollouts to maintain stable optimization. This approach introduces a consistent positive reference signal that preserves gradient variance and supports stable reasoning behavior. Experiments across multiple models show that ANCHOR improves convergence stability and reasoning reliability, suggesting a pathway toward integrating structured, symbol-like reasoning into reinforcement optimization for large language models.

Submission Number: 5

Loading