Keywords: Conversational Memory, Memory-Driven Dialogue, Long-Term Dialogue Evaluation, Style Consistency, Benchmark Dataset
Abstract: Most existing benchmarks for evaluating the memory of large language models (LLMs) rely on explicit recall-style question answering, where memory is directly queried and answered. However, such memory answering settings diverge from real world human–AI interaction, in which memory is rarely triggered explicitly and instead manifests implicitly by shaping dialogue generation.
We introduce NaturalMem, a natural dialogue-based benchmark for evaluating memory-driven dialogue, where memory influences responses and speaking style without explicit recall prompts. NaturalMem constructs multi-turn dialogues based on fictional character prototypes, excluding identifiable names or show-specific entities. Each character is associated with personal facts, category information, and a target speaking style, and is evaluated across multiple dialogue sessions to assess fact retention and style consistency.
Dialogue data is created through an LLM-assisted and human-curated pipeline. Experiments show that state-of-the-art Agents still struggle to retain personal facts and maintain stable speaking styles in memory-driven dialogue settings. NaturalMem provides a realistic and diagnostic framework for evaluating memory in LLMs.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: benchmarking, NLP datasets, evaluation methodologies, evaluation
Contribution Types: NLP engineering experiment, Data resources, Data analysis
Languages Studied: English
Submission Number: 4614
Loading