Keywords: Large Language Model, inference, Decoding algorithm
Abstract: Decoding algorithms play a central role in enhancing the performance of large language models (LLMs) on complex reasoning tasks.
A common approach incorporates Process Reward Models (PRMs), which estimate the quality of intermediate reasoning paths and guide the selection of possible continuations.
In this setting, our analysis reveals two notable phenomena: reward estimates tend to decline as reasoning progresses, and the reasoning paths exhibit distinct volatility patterns across decoding steps
depending on whether the paths lead to correct or incorrect final answers.
In particular, correct reasoning tends to be associated with stable reward trajectories, while incorrect reasoning often shows high volatility.
Motivated by this observation, we propose Volatility-Scaled Guided Decoding (VSGD), a decoding algorithm that prioritizes candidate paths with lower volatility by jointly considering the magnitude of PRM-estimated rewards and the volatility of these rewards across decoding steps.
Experiments on datasets including GSM8K and MATH500 indicate that VSGD reduces the volatility of selected reward trajectories and improves the accuracy of the final answer.
These findings suggest that considering the temporal dynamics of reward values, in addition to their magnitude, provides a potential direction for enhancing guided decoding in LLMs.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 19906
Loading