Keywords: Memory Evaluation, Agent Memory, Efficiency, Personalization, Long-term Interaction, Benchmark
Abstract: Memory in deployed LLM agents operates under a streaming regime where evidence arrives incrementally, context is bounded, and every pipeline stage incurs recurring latency and token cost. Yet existing benchmarks evaluate memory statically, providing full history at once and reporting only end-to-end accuracy, making it difficult to (i) attribute failures and costs to specific pipeline stages, and (ii) verify that correct answers truly depend on the memory system rather than in-context extraction or inference shortcuts. We propose a streaming evaluation framework that structures user--agent conversations into streaming episodes with explicit Evidence--Query dependencies and evaluates memory through a four-stage pipeline (Formation, Management, Retrieval, Application) with stage-level accuracy and efficiency metrics. We further show that simply converting existing static benchmarks into a streaming format is insufficient: retrieval and application accuracy can diverge substantially, indicating that some tasks remain solvable without faithfully retrieving the intended evidence. To address this leakage, we construct StreamMemBench, a natively streaming benchmark with per-episode evidence boundaries and evidence-linked distractors that ensure correct answers require the memory pipeline. Across five memory systems and three datasets, we find that formation accuracy saturates while efficiency differs by an order of magnitude, and that retrieval--application divergence serves as a reliable diagnostic signal for evaluation leakage.
Submission Number: 234
Loading