AURORA: Global GP-UCB Controllers for Test-Time Compute with Calibrated Verifiers ($\mu(x)+\beta\sigma(x)$ over Random Fourier Features)
Abstract: Deep learning systems increasingly rely on test-time compute, for example by sampling, voting, or iterating at inference to improve reliability. Existing controllers typically decide per item, which leaves two gaps: they lack a global view of difficulty across a dataset, and they provide uncertainty estimates that are informal and hard to audit.
We present AURORA, a global controller that frames test-time compute as a bandit-style allocation problem in a shared feature space. Concretely, we map problem instances into random Fourier features $\phi(x)$ built on TF–IDF and SVD embeddings. We fit a Bayesian ridge model on those features to obtain a closed-form posterior over $\phi(x)$ and to compute a predictive mean and variance. In practice we use the UCB index
$$
\mu(x)+\beta\sigma(x)
$$
to prioritize questions that the model predicts will benefit most from additional compute.
We also make candidate answers auditable and robust. First, we run a skeptical refutation loop that generates critiques and attempts targeted repairs. Second, we apply maximum marginal relevance to preserve diversity among solutions. Third, we score and fuse chains using calibrated verifiers: step-wise procedural checks (PRM) and outcome-level checks (ORM) with temperature scaling. These stages give interpretable verifier scores and traceable routing decisions.
We sketch the theory connecting our design to GP-UCB. Bayesian ridge on RFF provides a closed-form posterior for the feature weights. The resulting predictive variance induces a GP-like exploration bonus, which yields a principled allocation rule similar to classical regret-minimizing bandits. For clarity, the predictive quantities follow the usual ridge-regression posterior form:
$$
\hat w = ( \Phi^\top \Phi + \lambda I)^{-1}\Phi^\top y,\qquad
\mu(x)=\phi(x)^\top \hat w,
$$
$$
\sigma^2(x)=\phi(x)^\top(\lambda I + \Phi^\top \Phi)^{-1}\phi(x) + \sigma_n^2.
$$
Empirically, on GSM8K using the full test set, we emphasize diagnostic metrics rather than a single leaderboard score. We report reliability curves with expected calibration error near 0.11 for AURORA. We evaluate selective prediction using controller confidence versus self-consistency vote share as a baseline, and we analyze complementary wins and losses with McNemar tests. We also provide an anytime frontier built from partial logs, compute-sensitive metrics such as tokens-per-correct, and allocation histograms that show the controller concentrates compute on high-entropy, hard-tail cases. Ablations that remove refutation or replace the GP-like uncertainty model confirm that both components materially improve selective prediction and routing.
Takeaway. AURORA offers a principled, global approach to test-time compute. It turns uncertainty into actionable abstention and routing, makes compute allocation auditable, and grounds controller behavior in GP-UCB theory. Self-consistency remains a strong per-item baseline, but AURORA gives clear advantages when evaluators need calibrated confidence and traceable allocation.
Loading