A2Perf: Benchmarking Autonomous Agents End-to-End in Realistic Domains

Ikechukwu Uchendu; Jason Jabbour; Korneel Van den Berghe; Joel Runevic; Matthew Stewart; Jeffrey Jian Ma; Srivatsan Krishnan; Izzeddin Gur; Austin V Huang; Colton Bishop; Paige Bailey; Wenjie Jiang; Ebrahim Songhori; Sergio Guadarrama; Jie Tan; J K Terry; Aleksandra Faust; Vijay Janapa Reddi

A2Perf: Benchmarking Autonomous Agents End-to-End in Realistic Domains

Ikechukwu Uchendu, Jason Jabbour, Korneel Van den Berghe, Joel Runevic, Matthew Stewart, Jeffrey Jian Ma, Srivatsan Krishnan, Izzeddin Gur, Austin V Huang, Colton Bishop, Paige Bailey, Wenjie Jiang, Ebrahim Songhori, Sergio Guadarrama, Jie Tan, J K Terry, Aleksandra Faust, Vijay Janapa Reddi

Published: 09 Jun 2025, Last Modified: 14 Jul 2025CODEML@ICML25EveryoneRevisionsBibTeXCC BY 4.0

Keywords: reinforcement learning, benchmarking, reliability, system resource usage, imitation learning

TL;DR: A unified evaluation framework for autonomous agents in real-world domains, providing standardized benchmarks for chip floorplanning, web navigation, and quadruped locomotion with metrics for performance, generalization, efficiency, and reliability.

Abstract: Autonomous agents and systems cover a number of application areas, from robotics and digital assistants to combinatorial optimization, all sharing common, unresolved research challenges. It is not sufficient for agents to merely solve a given task; they must generalize to out-of-distribution tasks, perform reliably, and use hardware resources efficiently during training and on-device deployment, among other requirements. Several classes of methods, such as reinforcement learning and imitation learning, are commonly used to tackle these problems, each with different trade-offs. However, there is a lack of benchmarking suites that define the environments, datasets, and metrics which can be used to provide a meaningful way for the community to compare progress on applying these methods to real-world problems. We introduce A2Perf—a benchmarking suite including three environments that closely resemble real-world domains: computer chip floorplanning, web navigation, and quadruped locomotion. A2Perf provides metrics that track task performance, generalization, system resource efficiency, and reliability, which are all critical to real-world applications. In addition, we propose a data cost metric to account for the cost incurred acquiring offline data for imitation learning, reinforcement learning, and hybrid algorithms, which allows us to better compare these approaches. As an open-source and extendable benchmark, A2Perf is designed to remain accessible, documented, up-to-date, and useful to the research community over the long term.

Submission Number: 52

Loading