NaturalMem: A Benchmark for Memory-Driven Dialogue in Large Language Models

NaturalMem: A Benchmark for Memory-Driven Dialogue in Large Language Models

ACL ARR 2026 January Submission4614 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Conversational Memory, Memory-Driven Dialogue, Long-Term Dialogue Evaluation, Style Consistency, Benchmark Dataset

Abstract: Most existing benchmarks for evaluating the memory of large language models (LLMs) rely on explicit recall-style question answering, where memory is directly queried and answered. However, such memory answering settings diverge from real world human–AI interaction, in which memory is rarely triggered explicitly and instead manifests implicitly by shaping dialogue generation. We introduce NaturalMem, a natural dialogue-based benchmark for evaluating memory-driven dialogue, where memory influences responses and speaking style without explicit recall prompts. NaturalMem constructs multi-turn dialogues based on fictional character prototypes, excluding identifiable names or show-specific entities. Each character is associated with personal facts, category information, and a target speaking style, and is evaluated across multiple dialogue sessions to assess fact retention and style consistency. Dialogue data is created through an LLM-assisted and human-curated pipeline. Experiments show that state-of-the-art Agents still struggle to retain personal facts and maintain stable speaking styles in memory-driven dialogue settings. NaturalMem provides a realistic and diagnostic framework for evaluating memory in LLMs.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: benchmarking, NLP datasets, evaluation methodologies, evaluation

Contribution Types: NLP engineering experiment, Data resources, Data analysis

Languages Studied: English

Submission Number: 4614

Loading