AMA-Bench: Evaluating Long-Horizon Memory for Agentic Applications

Published: 03 Mar 2026, Last Modified: 12 Mar 2026ICLR 2026 Workshop MemAgentsEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM Agents, Agent memory, Benchmark
Abstract: Large Language Models (LLMs) are deployed as autonomous agents in increasingly complex applications, where enabling long horizon memory is critical for achieving strong performance. However, a significant gap exists between practical applications and current evaluation standards for agent memory: existing benchmarks primarily focus on dialogue centric, human agent interactions. In reality, agent memory consists of a continuous stream of agent environment interactions that are primarily composed of machine generated representations. To bridge this gap, we introduce AMA Bench (Agent Memory with Any Length) to evaluate long horizon memory for LLMs in real agentic applications. It features two key components: (1) a set of real world agentic trajectories across representative agentic applications paired with expert curated QA, and (2) a set of synthetic agentic trajectories that scale to arbitrary horizons paired with rule based QA. Our comprehensive study shows that existing memory systems underperform on AMA Bench primarily because they suffer from a loss of causality and objective information, and are constrained by the lossy nature of similarity based retrieval employed by many memory systems. To address these limitations, we propose AMA Agent, an effective memory system featuring a causality graph and tool augmented retrieval. Our results demonstrate that AMA Agent achieves 57.22% average accuracy on AMA Bench, surpassing the strongest memory system baselines by 11.16%.
Submission Number: 24
Loading