The Model Knows, the Decoder Finds: Future Value Guided Particle Power Sampling

Published: 30 May 2026, Last Modified: 30 May 2026SPIGM @ ICMLEveryoneRevisionsBibTeXCC BY 4.0
Keywords: test-time scaling; inference-time reasoning; sequential Monte Carlo; inference-time reasoning; value-guided decoding
TL;DR: APPS approximates power sampling with sequential Monte Carlo and future-value guidance, reallocating inference compute toward more promising reasoning paths.
Abstract: A recurring pattern in ``reasoning without training'' is that base LLMs already assign non-trivial probability mass to correct multi-step solutions; the bottleneck is locating these modes efficiently at inference time. A principled way to bias inference toward such modes is power sampling, i.e., sampling from $p_\theta(x)^\alpha$ with $\alpha>1$. Recent work makes power sampling practical by estimating a future-dependent correction factor $z_t$ via Monte Carlo rollouts, thereby replacing iterative Markov chain Monte Carlo with forward-looking estimation. In this paper, we reframe that correction factor as a future-value selection potential in a Sequential Monte Carlo (SMC) view of power sampling: $z_t$ plays the role of a critic-like quantity, but can be estimated directly from the model by short-horizon rollouts, no verifier and no training required. Building on this view, we introduce Auxiliary Particle Power Sampling (APPS), a blockwise particle algorithm for training-free reasoning that approximates the sequence-level power target with a bounded population of partial solutions. APPS propagates these hypotheses in parallel by proposal-corrected power reweighting and refines their survival through future-value-guided selection at resampling boundaries, so that finite compute is redistributed across competing prefixes rather than spent along a single unfolding path. This yields a transparent scaling knob in the particle count, predictable peak memory, and a compute pattern that avoids both iterative trajectory editing and dense candidate-wise rollout fan-out, while improving robustness to pivotal early decisions by keeping multiple hypotheses alive throughout decoding. We further study an amortized variant in which the rollout-based selection potential is replaced by a lightweight learned head trained offline from rollout supervision, enabling fast future-value guidance at inference time. More broadly, our results add to a growing view that a nontrivial part of the gains often attributed to post-training may also be approached through more faithful power approximation at inference time.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 10
Loading