AgentGym: Evaluating and Evolving Large Language Model-based Agents across Diverse Envronments

Zhiheng Xi; Yiwen Ding; Wenxiang Chen; Boyang Hong; Honglin Guo; Junzhe Wang; Dingwen Yang; Chenyang Liao; Xin Guo; Wei He; Songyang Gao; Lu Chen; Rui Zheng; Yicheng Zou; Tao Gui; Qi Zhang; Xipeng Qiu; Xuanjing Huang; Zuxuan Wu; Yu-Gang Jiang

AgentGym: Evaluating and Evolving Large Language Model-based Agents across Diverse Envronments

27 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: large language model, LLM-based agent, self-improvement, evaluation

TL;DR: We introduce AgentGym, an interactive framework with diverse scenarios for developing LLM-based agent. It also includes expanded instructions, trajectories, and benchmark. We explore agent self-evolution across environments with AgentEvol method.

Abstract: Large language models (LLMs), with their generalized capabilities, are considered as a promising foundation to build generally-capable agents that can handle multi-turn decision-making tasks across various interactive environments. Previous attempts typically gather expert-provided trajectories and have LLM-based agents imitate these trajectories step-by-step. However, this supervised fine-tuning approach depends heavily on human supervision, limiting scalability and restricting the agent's exploration and learning in the environments. In this paper, we take the first step towards developing generally-capable LLM-based agents that can explore and evolve themselves across diverse environments. To achieve this, we identify a trinity of ingredients: 1) diverse interactive environments for agent exploration, 2) a trajectory set to equip agents with basic capabilities and prior knowledge, and 3) an effective and scalable approach for agent improvement across environments. We propose AgentGym, a new interactive framework featuring various real-world scenarios and environments for broad, unified, real-time, and concurrent agent exploration. AgentGym also includes a database with expanded instructions, high-quality trajectories, and a benchmark suite. Next, we investigate the potential of agent self-evolution across various environments with a derived exploration-learning method named AgentEvol. Experimental results show that the evolved agents can achieve results comparable to SOTA models. We will release the code, dataset, benchmark, and checkpoints.

Supplementary Material: zip

Primary Area: generative models

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 11260

Loading

AgentGym: Evaluating and Evolving Large Language Model-based Agents across Diverse Envronments

Zhiheng Xi, Yiwen Ding, Wenxiang Chen, Boyang Hong, Honglin Guo, Junzhe Wang, Dingwen Yang, Chenyang Liao, Xin Guo, Wei He, Songyang Gao, Lu Chen, Rui Zheng, Yicheng Zou, Tao Gui, Qi Zhang, Xipeng Qiu, Xuanjing Huang, Zuxuan Wu, Yu-Gang Jiang