A Benchmark for Self-Evolving Agents via Experience-Driven Lifelong Learning

Yuxuan Cai; Yipeng Hao; Jie Zhou; Hang Yan; Zhikai Lei; Rui Zheng; Zhenhua Han; Yutao Yang; Junsong Li; Qianjun Pan; Tianyu Huai; Qin Chen; Kai Chen; Bo Zhang; Xipeng Qiu; Liang He

A Benchmark for Self-Evolving Agents via Experience-Driven Lifelong Learning

Yuxuan Cai, Yipeng Hao, Jie Zhou, Hang Yan, Zhikai Lei, Rui Zheng, Zhenhua Han, Yutao Yang, Junsong Li, Qianjun Pan, Tianyu Huai, Qin Chen, Kai Chen, Bo Zhang, Xipeng Qiu, Liang He

13 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Experience-Driven Lifelong Learning, Self-Evolving Agent, Skill Learning, Long-Term Memory, Self-Motivation, Continual Learning

Abstract: As AI advances toward general intelligence, the focus is shifting from systems optimized for static tasks to creating open-ended agents that learn continuously and adapt autonomously from experiences. This vision emphasizes long-term memory, self-driven exploration, persistent experience retention, and the internalization of knowledge into intuitive behavior as key to enabling self-evolving agents through experience-driven lifelong learning (ELL). In this paper, we introduce StuLife, a novel benchmark designed to evaluate whether current models can exhibit these foundational capabilities of ELL. Particularly, StuLife simulates a student's holistic college journey, from enrollment to academic and personal development, across three core phases and ten detailed sub-scenarios. StuLife is designed around four key paradigm shifts: From Simulation to Reality, From Passive to Proactive, From Context to Memory, and From Imitation to Learning. In this dynamic environment, agents must acquire and distill practical skills and maintain persistent memory to make decisions based on evolving state variables (e.g., resource availability and time). Critically, these agents are also expected to demonstrate intrinsic motivation by setting their own goals and initiating actions without external prompting. To this end, StuLife provides a comprehensive evaluation platform featuring our novel metrics (e.g., StuGPA) to specifically assess these critical capabilities. Our evaluation reveals that even the best model, GPT-5, scores only 17.9/100, revealing a vast gap toward AGI, demonstrating fundamental deficiencies in retaining long-term memory and acting with self-motivated initiative. Beyond evaluating state-of-the-art LLMs on the StuLife, we also explore the role of context engineering in advancing AGI. Our results suggest that optimizing how we guide models may be as crucial as improving the models themselves, positioning context engineering as a key enabler of progress toward AGI.

Supplementary Material: zip

Primary Area: datasets and benchmarks

Submission Number: 4678

Loading