Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions

Published: 10 Jun 2025, Last Modified: 27 Jun 2025LCFM 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM Agents; Agents with Memory; Memory Agents Benchmark; Long Context LLM
TL;DR: We introduce a unified evaluation framework designed for memory agents.
Abstract: Recent benchmarks for Large Language Model (LLM) agents have primarily focused on evaluating planning and execution capabilities, while another critical component—memory, encompassing how agents store, retrieve, and update long-term information—has much fewer benchmarks for evaluation. We term agents with memory mechanisms as memory agents. In this paper, we identify four core competencies essential for memory agents: accurate retrieval, test-time learning, long-range understanding, and conflict resolution. Existing datasets either rely on limited context lengths or are tailored for static, long-context settings like book-based QA, which do not reflect the interactive, multi-turn nature of memory agents that incrementally accumulate information. Furthermore, no existing benchmarks cover all four competencies. Therefore, we introduce MemAE (Memory Agent Evaluation), a unified evaluation framework specifically designed for memory agents. Our benchmark combines reformulated existing datasets with newly constructed ones, covering above four identified memory competencies, providing a systematic and challenging testbed for assessing memory quality. We evaluate a diverse set of memory agents, ranging from simple context-based and retrieval-augmented generation (RAG) systems to advanced agents with external memory modules and tool integration. Empirical results reveal that current methods fall short of mastering all four competencies, underscoring the need for further research into comprehensive memory mechanisms for LLM agents.
Submission Number: 4
Loading