Odysseys: Benchmarking Web Agents on Realistic Long Horizon Tasks
Keywords: Web Agents, Computer-Use Agents, LLM Agents, Long Horizon
Abstract: Existing web agent benchmarks have largely converged on short, single-site tasks that frontier models are approaching saturation on. However, real web navigation tasks, such as comparing products
across different domains, planning trips across
multiple services, or summarizing information
from multiple search queries, require sustained
context and cross-site reasoning over potentially
hours of browsing. To capture and evaluate such
behaviors, we introduce Odysseys: a benchmark
of 200 long-horizon web tasks derived from real
world browsing sessions evaluated on the live
Internet. We find that binary pass/fail evaluation is inadequate for long-horizon settings and
introduce a rubric-based evaluation, annotating
each Odysseys task with an average of 6.1 graded
rubrics. We demonstrate that this yields higher
agreement with humans and provides a more fine-
grained signal than commonly used trajectory-
level LLM-as-a-judge evaluation metrics. We
tested several leading frontier models and find
that the strongest models achieve a success rate of
44.5%, which leaves substantial room for future
improvements. Beyond task success, we argue
that efficiency is a first-class concern for long-
horizon agents. We introduce a Trajectory Efficiency metric (rubric score per step) and find that
even frontier agents achieve only 1.15%, marking an evident need for agents that can succeed
efficiently and not simply eventually. Odysseys
isolates the critical evaluation of long-horizon proficiency in open-web environments, providing a
realistic benchmark to measure progress towards
computer-use agents that can potentially productively operate for hours. We release our tasks and
evaluation scripts at removed for review.
Track: Regular Paper (9 pages)
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 23
Loading