Keywords: LLM Agent, Memory, Continual Learning
Abstract: Effective long-term memory for LLM-based agents encompasses two fundamentally distinct capabilities: **system memory** (a.k.a. experiential memory), which distills reusable procedural knowledge from task execution, and **personal memory** (a.k.a. factual memory), which retains user-specific facts and preferences across sessions. In real-world deployments, these two memory types co-evolve with blurred boundaries—yet existing methods and benchmarks treat them in isolation, an assumption that breaks down the moment an agent must simultaneously execute tasks and serve a persistent user. We introduce **AgentMemoryBench**, the first benchmark to jointly evaluate system and personal memory under a unified continual-learning framework. It spans six datasets across four environment types and standardizes five complementary evaluation modes (offline, online, replay, transfer, and repair) to measure improvement, retention, forgetting, generalization, and knowledge conflict resolution over time. Building on this benchmark, we propose **MEMs**, a multi-memory coordination framework that maintains separate specialized stores and employs a lightweight trigger model as a meta-cognitive router to selectively retrieve and update each store. Experiments reveal that single-memory and in-context learning designs suffer systematic performance collapse in mixed-task regimes due to memory contamination and architectural mismatch, while MEMs maintains stable learning across task boundaries. AgentMemoryBench establishes a reproducible evaluation loop and provides practical guidance for developing memory systems that hold up in the real world.
**Source code**: [https://github.com/s010m00n/AgentMemoryBench](https://github.com/s010m00n/AgentMemoryBench)
Submission Number: 208
Loading