Keywords: agent benchmarks, interrupted-state recovery, task resumption, checkpoint sufficiency, handoff quality, agent reliability, benchmark construction, multi-domain evaluation, partial-state reasoning, recovery robustness
TL;DR: ContinuityBench evaluates whether LLM agents can reliably resume partially completed tasks after interruption, showing that interruptions substantially affect success.
Abstract: LLM agents are increasingly deployed in long-running, user-facing settings where execution can be broken by user interventions, tool failures, and context-management constraints. Yet standard agent benchmarks mostly evaluate uninterrupted runs, leaving recovery behavior largely unmeasured. We introduce ContinuityBench, a benchmark-agnostic framework that turns step-based agent benchmarks into controlled tests of continuity under interruption: it runs an uninterrupted baseline, interrupts execution at controlled points, resumes the same partially completed task from the live environment state, and varies the handoff signal across three fidelity levels - h0 with no prior context, h1 with a structured summary, and h2 with summary plus full action history - while preserving the source benchmark's native evaluator. Instantiating ContinuityBench on tau-bench, AppWorld, and TerminalBench with GPT-5.1 and Gemini 3 Flash, we find that interruption drops average task success from 41.7% to 28.0%. Handoff fidelity is non-monotonic: h1 outperforms h2 in 11 of 18 benchmark/model/interruption settings. Trace analysis shows distinct recovery failures: conversational frame drift, recovery churn, and over-steering from richer context. These results identify resumption as a measurable axis of agent reliability that aggregate task-success metrics miss, and show that more handoff context is not automatically better.
Paper Type: Long (8 pages)
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 47
Loading