Anytime Verified Agents: Adaptive Compute Allocation for Reliable LLM Reasoning under Budget Constraints
Abstract: Large language model (LLMs) agents show promising results in reasoning, planning, and tool use. However, their performance scales with the computational budget. Existing methods allocate computational resources using static strategies such as fixed search depths, constant self-consistency sampling, or uniform verification. This means that simple problems are used as much as complex tasks. We present Anytime Verified Agents (AVA), a framework that dynamically allocates compute search, tool use, and verification within a user-specified budget. AVA integrates calibrated uncertainty estimation, value-of-information-guided search expansion, and selective verification cascades with early exits. The controller dynamically allocates the compute based on the predicted failure risk and marginal reliability gains, allowing the agent to achieve higher accuracy at fixed budgets or lower costs at target reliability levels. AVA is evaluated on mathematical reasoning (GSM8K), multi-hop question answering (HotpotQA), and code generation (HumanEval) benchmarks, and it is compared to fixed-depth search, self-consistency, and always-verify baselines. The results show that the adaptive allocation achieves a 20-40% cost reduction at equivalent reliability while maintaining accuracy, showing clear Pareto improvements in the compute-reliability trade-off.
Submission Type: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: We thank the reviewer for their constructive feedback and positive assessment of our work. We have revised the manuscript to address all requested changes.
**1. Sensitivity Analysis (Requested Plot)**
We have added Figure 3 and expanded Table 5 to include an explicit Accuracy column. The figure shows Cost vs. Accuracy as controller thresholds vary across 35 configurations on GSM8K. As the table and figure demonstrate, accuracy remains within one percentage point (81.7%–82.4%) while cost varies by roughly 30% (630–830 tokens). The thresholds were derived via grid search on validation data; the analysis confirms the controller is robust to threshold perturbations.
**2. Calibration Transfer / Controller Failure**
Table 7 already reports transfer experiments (GSM8K→MATH, GSM8K→HotpotQA). We have added clarifying text to the Discussion section (Section 6): "The controller does not fail outright; allocation decisions remain reasonable because relative uncertainty ordering is partially preserved, though reliability at high confidence thresholds drops (52.4% actual accuracy at 90% reported confidence for GSM8K-to-MATH transfer, versus 82.5% in-domain)."
**3. Comparison with SelfBudgeter and Strategic Scaling**
Both works are already discussed in Related Work (Section 2) and included in Table 1. Lines 95–98 explain the key differences: SelfBudgeter adapts token count only and requires fine-tuning; Strategic Scaling uses bandit-based sample allocation only. AVA jointly controls multiple dimensions (search, sampling, verification, tool use) using calibrated uncertainty without requiring model fine-tuning. We also note that these approaches could be complementary.
Assigned Action Editor: ~Weiyang_Liu1
Submission Number: 6525
Loading