CoMem: A Benchmark for Continual Memory and Dynamic Preference Evolution in Long-Context Agents

CoMem: A Benchmark for Continual Memory and Dynamic Preference Evolution in Long-Context Agents

ACL ARR 2026 January Submission2305 Authors

02 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Memory System, Benchmark, User Preferences

Abstract: The transition of Large Language Models (LLMs) from stateless engines to lifelong agents requires robust capabilities to track dynamic user preferences amidst temporal noise. However, most existing benchmarks predominantly focus on static retrieval fidelity, neglecting the longitudinal evolution of user states. To bridge this gap, we introduce CoMem, a benchmark designed to evaluate continual memory and dynamic preference evolution across sequential dialogue checkpoints. CoMem incorporates two distinct scenarios: user-assistant dyadic interactions and multi-party dialogues, across three context lengths (up to 128k tokens). We introduce a rigorous evaluation protocol adapting metrics from continual learning—specifically Forgetting Measure and Forward Transfer—applied to sequential dialogue checkpoints. Through rigorous evaluation of diverse LLMs and memory architectures, our experiments yield four critical insights: (1) Native Context Dominance:} Native long-context attention mechanisms significantly outperform external retrieval systems, which tend to discard subtle evolutionary signals; (2) The Forgetting Trap: Most systems suffer from severe catastrophic forgetting and ``Forward Transfer Saturation'' as dialogue complexity increases, failing to update outdated beliefs; (3) The Oscillation Phenomenon: High average accuracy often masks underlying volatility, where agents inconsistently flip between correct and incorrect answers across checkpoints; and (4) Reasoning Limits: While reasoning-enhanced models act as denoising filters for noisy retrieval, they encounter cognitive load thresholds in ultra-long contexts. The CoMem data construction pipeline and evaluation toolkit provide essential insights for developing next-generation personalized agents.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: NLP datasets, evaluation methodologies, evaluation, benchmarking, retrieval, retrieval-augmented generation

Contribution Types: Data resources, Data analysis

Languages Studied: English

Submission Number: 2305

Loading