Challenges in Inference-Time Scaling with Uncertainty-Aware Tree Search
Keywords: reasoning, uncertainty quantification, inference-time scaling, process reward models
TL;DR: We attempt to mitigate reward hacking in inference-time search by using uncertainty estimates to guide compute allocation, but find that search optimization exploits these estimates, causing them to become miscalibrated and degrade performance.
Abstract: Inference-time search has emerged as a powerful paradigm for scaling large language models' reasoning capabilities.
Standard approaches like beam search rely on process reward models (PRMs) for dense, step-by-step scoring to identify promising reasoning paths.
However, scaling these methods encounters a known failure mode: as compute budgets increase, search algorithms explore out-of-distribution states spuriously assigned high reward, decoupling proxy reward from actual reasoning ability.
To address this, we propose Uncertainty-Aware Tree Search (UATS), which uses a process uncertainty model (PUM) to predict when PRM predictions are unreliable.
UATS dynamically allocates computational resources by increasing the branching factor at high-uncertainty nodes to resolve ambiguity through exploration, unlike the fixed branching of standard beam search.
In our evaluation, while PUMs perform well on held-out in-distribution data, this does not translate to improved downstream search. On instruction-tuned models, UATS matches standard beam search, but on RLVR-trained models, it consistently degrades performance as inference-time compute grows.
This negative result suggests that search-induced distribution shift causing poor PRM generalization similarly affects process uncertainty models. Our results suggest that uncertainty-guided inference-time scaling requires robust uncertainty quantification models that remain reliable under search-induced distribution shift.
Submission Number: 30
Loading