CaveAgent: Transforming LLMs into Stateful Runtime Operators

TMLR Paper9267 Authors

28 May 2026 (modified: 01 Jun 2026)Withdrawn by AuthorsEveryoneRevisionsBibTeXCC BY 4.0
Abstract: LLM-based agents are increasingly capable of complex task execution, yet current agentic systems remain constrained by text-centric paradigms that struggle with long-horizon tasks due to fragile multi-turn dependencies and context drift. We present CaveAgent, a framework that shifts LLM tool use from "LLM-as-Text-Generator" to "LLM-as-Runtime-Operator." CaveAgent introduces a dual-stream architecture: a semantic stream for lightweight reasoning and a runtime stream backed by a persistent Python environment for stateful execution. Rather than treating the LLM's text context as the primary workspace, CaveAgent elevates the persistent runtime as the central locus. Beyond leveraging code generation to resolve interdependent sub-tasks (e.g., loops, conditionals) in a single step, CaveAgent introduces Stateful Runtime Management: it injects, manipulates, and retrieves complex Python objects (e.g., DataFrames, database connections) that persist across turns, unlike existing code-based approaches that remain text-bound. CaveAgent further provides a runtime-integrated skill management system that extends the Agent Skills open standard, enabling ecosystem interoperability through executable skill injections. This persistence mechanism serves as a high-fidelity external memory that reduces context drift in multi-turn interactions and preserves processed data for downstream applications with less information loss. Evaluations on Tau$^2$-bench and the Berkeley Function Calling Leaderboard (BFCL) across six state-of-the-art LLMs demonstrate consistent improvements in 11 out of 12 settings, with gains up to +13.5% success rate on multi-turn retail tasks. On BFCL, the three open-source models we evaluate all reach 94.0-94.7% under CaveAgent, comparable to closed-source Claude Sonnet 4.5 (94.4%) and Gemini 3 Pro (94.3%) and exceeding GPT-5.1 (89.6%) under their native function-calling protocols; the 30B Qwen3-Coder reaching 94.4% suggests the function-calling protocol is a key performance bottleneck alongside model scale. Token efficiency studies show 28.4% reduction in total token consumption and up to 51% token reduction on data-intensive tasks relative to the best baseline. The accessible runtime state further provides programmatically verifiable feedback, enabling automated evaluation and reward signal generation without human annotation and establishing a structural foundation for future research in Reinforcement Learning with Verifiable Rewards (RLVR).
Submission Type: Long submission (more than 12 pages of main content)
Assigned Action Editor: ~Tian_Li1
Submission Number: 9267
Loading