Keep the Beam on Track: Stabilizing Reward Trajectories in Guided Decoding

Keep the Beam on Track: Stabilizing Reward Trajectories in Guided Decoding

ICLR 2026 Conference Submission19906 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Model, inference, Decoding algorithm

Abstract: Decoding algorithms play a central role in enhancing the performance of large language models (LLMs) on complex reasoning tasks. A common approach incorporates Process Reward Models (PRMs), which estimate the quality of intermediate reasoning paths and guide the selection of possible continuations. In this setting, our analysis reveals two notable phenomena: reward estimates tend to decline as reasoning progresses, and the reasoning paths exhibit distinct volatility patterns across decoding steps depending on whether the paths lead to correct or incorrect final answers. In particular, correct reasoning tends to be associated with stable reward trajectories, while incorrect reasoning often shows high volatility. Motivated by this observation, we propose Volatility-Scaled Guided Decoding (VSGD), a decoding algorithm that prioritizes candidate paths with lower volatility by jointly considering the magnitude of PRM-estimated rewards and the volatility of these rewards across decoding steps. Experiments on datasets including GSM8K and MATH500 indicate that VSGD reduces the volatility of selected reward trajectories and improves the accuracy of the final answer. These findings suggest that considering the temporal dynamics of reward values, in addition to their magnitude, provides a potential direction for enhancing guided decoding in LLMs.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 19906

Loading