AMA-Bench: Evaluating Long-Horizon Memory for Agentic Applications

Yujie Zhao; Boqin Yuan; Junbo Huang; Haocheng Yuan; Zhongming Yu; Haozhou Xu; Lanxiang Hu; Abhilash Shankarampeta; Zimeng Huang; Wentao Ni; Yuandong Tian; Jishen Zhao

AMA-Bench: Evaluating Long-Horizon Memory for Agentic Applications

Yujie Zhao, Boqin Yuan, Junbo Huang, Haocheng Yuan, Zhongming Yu, Haozhou Xu, Lanxiang Hu, Abhilash Shankarampeta, Zimeng Huang, Wentao Ni, Yuandong Tian, Jishen Zhao

Published: 30 Apr 2026, Last Modified: 24 Jun 2026ICML 2026 regularEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large Language Models (LLMs) are increasingly used as autonomous agents in complex, long-horizon applications, where effective memory is critical for sustained performance. Yet existing memory benchmarks are largely dialogue-centric, while real agent memory consists of continuous agent-environment interaction trajectories composed of states, actions, observations, and tool outputs. To address this gap, we introduce **AMA-Bench** (**A**gent **M**emory with **A**ny length), a benchmark for evaluating long-horizon memory in realistic agentic settings. AMA-Bench combines real-world agent trajectories from representative applications with expert-curated QA, as well as synthetic trajectories that scale to arbitrary horizons with rule-based QA. Our study shows that existing memory systems underperform because they fail to capture causal and objective information and rely heavily on lossy similarity-based retrieval. We further propose **AMA-Agent**, a memory system based on causality-graph construction and tool-augmented retrieval. AMA-Agent achieves **57.22%** accuracy on AMA-Bench, outperforming the strongest baseline by **11.16%**. Resources are available at: [https://ama-bench.github.io/](https://ama-bench.github.io/).

Lay Summary: Large language models are increasingly being used as agents that can take actions, use tools, and work through complex tasks over long periods of time. For these agents to be useful, they need memory: they must remember what happened earlier, why certain decisions were made, and how past actions affected later outcomes. However, most existing ways of testing agent memory focus mainly on conversations, while real AI agents often produce long streams of tool calls, code, files, logs, and other machine-generated records. We introduce AMA-Bench, a benchmark for testing whether AI agents can remember and use information from long, realistic task histories. AMA-Bench includes both real agent activity records with expert-written questions and synthetic records that can be made arbitrarily long with automatically checked answers. Using AMA-Bench, we find that current memory systems often miss important cause-and-effect relationships and rely too heavily on simple text matching. To address this, we propose AMA-Agent, a memory system that organizes past events into a cause-and-effect graph and uses tools to retrieve more useful information. AMA-Agent performs substantially better than existing methods, showing a promising direction for building AI agents with more reliable long-term memory.

Originally Submitted Supplementary Material: zip

Link To Code: https://github.com/AMA-Bench/AMA-Bench

Primary Area: Deep Learning->Large Language Models

Keywords: LLM Agents, Agent memory, Benchmark

Originally Submitted PDF: pdf

Submission Number: 6752

Loading