Allocation, Not Volume: Test-Time Compute for Agentic Forecasting

Published: 11 Jun 2026, Last Modified: 11 Jun 2026Forecast@ICML26 OralEveryoneRevisionsBibTeXCC BY 4.0
Keywords: test-time compute, forecasting, LLM multi-agent systems, adaptive compute allocation
Abstract: Test-time compute scaling has been studied extensively in verifiable domains such as math and code; how to spend an inference budget for forecasting future events, where no test-time verifier exists, is far less studied. We compare three multi-agent compute-allocation policies, static depth (*Predictor-Critic*), static breadth (*Ensemble*), and adaptive routing (*Hierarchical Orchestrator*), on a contamination-controlled benchmark of $228$ label-balanced binary ForecastBench questions resolved strictly after every base model's knowledge cutoff. On $\texttt{gpt-5.4-mini}$, adaptive routing occupies the entire cost-accuracy Pareto frontier ($80.7%$ at USD $0.18$/q vs. $78.1%$/USD $0.90$ Ensemble and $76.8%$/USD $1.67$ Predictor-Critic). The same Pareto ordering replicates on $\texttt{gpt-5.4-nano}$, where the Orchestrator is about $13\times$ cheaper than the top Ensemble with no statistically significant accuracy gap; the Orchestrator's cost-quality advantage further extends to Claude Sonnet 4.5, DeepSeek v4 Flash, and Gemini 2.5 Flash, so the result is not a single-model artefact. A two-stage diagnostic explains the win as *selective spending*: the Orchestrator concentrates compute on questions where its cheap direct baseline is uncertain.
Submission Number: 117
Loading