Anytime Verified Agents: Adaptive Compute Allocation for Reliable LLM Reasoning under Budget Constraints

Published: 19 Apr 2026, Last Modified: 19 Apr 2026Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Large language model (LLM) agents can perform multi-step reasoning, planning, and tool use. However, their performance scales with the computational budget. Existing methods allocate computational resources using static strategies such as fixed search depths, constant self-consistency sampling, or uniform verification, so simple problems can consume as much compute as complex tasks. We present Anytime Verified Agents (AVA), a framework that dynamically allocates compute across search, sampling, and verification within a user-specified budget, with an extensible interface for tool use. AVA combines calibrated uncertainty estimation, value-of-information-guided search expansion, and selective verification cascades with early exits. The controller allocates compute based on uncertainty and estimated marginal reliability gains. AVA is evaluated on mathematical reasoning (GSM8K and MATH), multi-hop question answering (HotpotQA), and code generation (HumanEval), with two model backends (GPT-5 and GPT-4o), and compared to fixed-depth search, self-consistency, and always-verify baselines. Across these benchmarks, AVA reduces cost at matched reliability thresholds while maintaining comparable accuracy.
Submission Type: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: Major revision addressing all reviewer feedback (pre-acceptance): 1. Added two simpler adaptive baselines (confidence-threshold exit, adaptive sampling rule) in Table 5 to justify AVA's complexity over lightweight heuristics. 2. Added verifier-noise stress test (Table 12). 3. Reorganized Section 3 around the three-stage architecture. 4. Improved notation formality: notation table (Table 2), formal pseudocode (Algorithms 1-2), all variables use proper symbols. 5. Expanded Discussion section and added Broader Impact statement. Camera-ready revision addressing Action Editor recommendation: 6. Added MATH benchmark (Hendrycks et al., 2021) evaluated with GPT-4o, addressing the request for a more challenging benchmark than GSM8K. Results in Table 2 and evaluation details in Appendix D. 7. Added cross-model evaluation discussion (GPT-5 and GPT-4o) in Section 6, characterizing AVA's advantage as regime-dependent. 8. De-anonymized and switched to accepted TMLR template.
Code: https://github.com/llmsresearch/AVA
Assigned Action Editor: ~Weiyang_Liu1
Submission Number: 6525
Loading