Recovery-Bench: Evaluating Agentic Recovery from Mistakes

Published: 24 Sept 2025, Last Modified: 24 Sept 2025NeurIPS 2025 LLM Evaluation Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: agents, recovery, long-horizon tasks, benchmark
TL;DR: Recovery-Bench is a challenging task to evaluates agents' capabilities to recover from previous mistakes.
Abstract: We introduce Recovery‑Bench, a novel evaluation benchmark designed to measure the ability of language model agents to recover from prior mistakes and corrupted context. The capability to recover is critical for long‑horizon tasks. By initializing agents with environments and action histories derived from failed trajectories, we demonstrate that models' recovery performance differs markedly from their performance in fresh contexts. Notably, GPT‑5, which underperforms in our clean settings for solving terminal tasks, significantly improves in recovery scenarios. Our results suggest that resilience to context pollution is a distinct metric of model and agent performance, underscoring the importance of designing benchmarks that reflect realistic, messy deployment environments.
Submission Number: 89
Loading