HaluMem: Evaluating Hallucinations in Memory Systems of Agents

HaluMem: Evaluating Hallucinations in Memory Systems of Agents

01 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: memory system, hallucination, multi-turn human–AI dialogue, benchemark, operation-level evaluation, llms

Abstract: Memory systems are essential for enabling long-term learning and sustained interaction in AI systems such as LLMs and agents. However, memory storage and retrieval often suffer from hallucinations, including fabrication, errors, conflicts, and omissions. Existing evaluations mainly rely on end-to-end question answering, making it difficult to identify which memory operation causes hallucinations. To address this, we propose the Hallucination in Memory Benchmark (HaluMem), the first operation-level benchmark for evaluating hallucinations in memory systems. HaluMem defines three tasks (memory extraction, memory updating, and memory question answering) to comprehensively reveal hallucination behaviors across different operational stages of interaction. To support evaluation, we construct user-centric, multi-turn human-AI interaction datasets, HaluMem-Medium and HaluMem-Long, containing about 15k memory points and 3.5k questions. The datasets feature long dialogues, with average lengths of 1.5k and 2.6k turns and context sizes exceeding 1M tokens, enabling evaluation across varying context scales and task complexities. Experiments on HaluMem show that current memory systems tend to introduce and accumulate hallucinations during extraction and updating, which then propagate to question answering. These findings highlight the need for interpretable and constrained memory operation mechanisms to improve memory reliability. All resources are available at [GitHub](https://anonymous.4open.science/r/HaluMem-691C).

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: benchmarking,evaluation methodologies,evaluation,metrics,language resources

Contribution Types: NLP engineering experiment, Data resources, Data analysis

Languages Studied: English

Submission Number: 1932

Loading