A2Perf: Real-World Autonomous Agents Benchmark

Ikechukwu Uchendu; Jason Jabbour; Korneel Van den Berghe; Joel Runevic; Matthew Stewart; Jeffrey Jian Ma; Srivatsan Krishnan; Austin V Huang; Izzeddin Gur; Colton Bishop; Paige Bailey; Wenjie Jiang; Ebrahim Songhori; Sergio Guadarrama; Jie Tan; J K Terry; Aleksandra Faust; Vijay Janapa Reddi

A2Perf: Real-World Autonomous Agents Benchmark

Ikechukwu Uchendu, Jason Jabbour, Korneel Van den Berghe, Joel Runevic, Matthew Stewart, Jeffrey Jian Ma, Srivatsan Krishnan, Austin V Huang, Izzeddin Gur, Colton Bishop, Paige Bailey, Wenjie Jiang, Ebrahim Songhori, Sergio Guadarrama, Jie Tan, J K Terry, Aleksandra Faust, Vijay Janapa Reddi

27 Sept 2024 (modified: 15 Nov 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: benchmark, reinforcement learning, autonomous agents, agents, benchmarking

TL;DR: Benchmark for holistically evaluating real-world autonomous agents, including challenging domains like computer chip-floorplanning, web navigation, and quadruped locomotion.

Abstract: Autonomous agents and systems cover a number of application areas, from robotics and digital assistants to combinatorial optimization, all sharing common, unresolved research challenges. It is not sufficient for agents to merely solve a given task; they must generalize to out-of-distribution tasks, perform reliably, and use hardware resources efficiently during training and on-device deployment, among other requirements. Several major classes of methods, such as reinforcement learning and imitation learning, are commonly used to tackle these problems, each with different trade-offs. However, there is currently no benchmarking suite that defines the environments, datasets, and metrics which can be used to develop reference implementations and seed leaderboards with baselines, providing a meaningful way for the community to compare progress. We introduce A2Perf---a benchmarking suite including three environments that closely resemble real-world domains: computer chip floorplanning, web navigation, and quadruped locomotion. A2Perf provides metrics that track task performance, generalization, system resource efficiency, and reliability, which are all critical to real-world applications. In addition, we propose a data cost metric to account for the cost incurred acquiring offline data for imitation learning, reinforcement learning, and hybrid algorithms, which allows us to better compare these approaches. A2Perf also contains baseline implementations of standard algorithms, enabling apples-to-apples comparisons across methods and facilitating progress in real-world autonomy. As an open-source and extendable benchmark, A2Perf is designed to remain accessible, documented, up-to-date, and useful to the research community over the long term.

Supplementary Material: zip

Primary Area: datasets and benchmarks

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 11791

Loading