The Agent's Marathon: Probing the Limits of Endurance in Long-Horizon Tasks

The Agent's Marathon: Probing the Limits of Endurance in Long-Horizon Tasks

ICLR 2026 Conference Submission13939 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: natural language model, agent, benchmark

Abstract: Large Language Model (LLM) agents, augmented with diverse tools, have shown impressive progress in domains such as scientific discovery and enterprise automation. Yet they remain brittle in long-horizon tasks that require extended sequences of interactions, where performance often deteriorates rapidly. Existing benchmarks provide only partial coverage of this challenge: manual or crowdsourced tasks are too short, tool-use benchmarks emphasize breadth over depth, and web-based evaluations rely on emergent rather than controllable complexity. To fill this gap, we introduce TaskWeaver, a rule-based, controllable platform for generating benchmark tasks with precisely adjustable difficulty and horizon length. At its core, TaskWeaver abstracts all tool use as file-read operations. This design choice removes superficial API complexities, allowing us to directly probe an agent’s core ability to reason and integrate intermediate results over long, dependent sequences. We instantiate the framework across three domains: document understanding and navigation, multi-modal information integration, and executable code analysis. Each domain probes a complementary aspect of agentic reasoning, and together they form a unified benchmark, LORE (Long-horizon Reasoning Evaluation). Empirical results show that even for the strongest models we tested, performance degrades significantly as task length and per-step complexity increase. Specifically, their accuracy approaches zero on tasks exceeding 120 steps, and on more challenging variants, performance collapses in fewer than 15 steps. These findings highlight long-horizon robustness as a central open challenge for future agent development.

Primary Area: datasets and benchmarks

Submission Number: 13939

Loading