$\nabla$-Reasoner: LLM Reasoning via Test-Time Gradient Descent in Textual Space

ICLR 2026 Conference Submission14424 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM Reasoning, Test-Time Scaling;Textual Optimization
TL;DR: We propose $\nabla$-reasoner, an iterative decoding approach with policy refinement by test-time gradient descent on textual representations to improve LLM reasoning.
Abstract: Scaling inference-time compute for Large Language Models (LLMs) has unlocked unprecedented reasoning capabilities. However, existing inference-time scaling methods typically rely on inefficient and suboptimal discrete search algorithms or trial-and-error prompting to improve the online policy. In this paper, we propose $\nabla$-reasoner, an iterative generation framework that integrates differentiable optimization over token logits into the decoding loop to refine the policy on the fly. Our core component, Differentiable Textual Optimization (DTO), leverages gradient signals from both the LLM’s likelihood and a reward model to refine textual representations. $\nabla$-reasoner further incorporates rejection sampling and acceleration design to robustify and speed up decoding. Theoretically, we show that aligning an LLM with a reward function is equivalent to performing inference-time gradient descent in the sample space. Empirically, $\nabla$-reasoner achieves over 20% accuracy improvement on a challenging mathematical reasoning benchmark, while reducing computation by approximately 40% compared to strong baselines. Overall, our work introduces a paradigm shift from zeroth-order search to first-order optimization at test time, offering a cost-effective path to amplify LLM reasoning.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 14424
Loading