Benchmarking Long-Term Memory with Continuous dialogue Lifelogs

20 Sept 2025 (modified: 22 Dec 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: memory agent, lifelog, memory dataset
TL;DR: We build two benchmarks for real world memory agent evaluation in lifelog scenarios
Abstract: Memory system in the real world holds considerable promise, especially in the potential continuous dialogue lifelogs scenarios, where wearable devices with microphone always-on can keep recording the surrounding dialogue. Existing benchmarks mostly focus on Person-AI interaction or Person-Person conversations, neglecting the continuous dialogue lifelogs scenarios, integrating multi-person interaction, causal and temporal event threads and so on. In this paper, we propose two benchmark, named \textbf{EgoMemBench} and \textbf{LifeMemBench}, with a hierarchical life simulation framework. EgoMemBench is built in a bottom-up manner from a real-world lifelogging video dataset EgoLife over a seven-day period, while LifeMemBench is simulated by LLMs with a top-down elaboration to generate year-long personal lifelogs. Based on the hierarchical data with different temporal granularities, we design an automatic question-answering construction pipeline to generate four types with high-quality. Regarding the evaluation mode, employing both online and offline approaches--with the online mode prioritized, as it better aligns with the continuous dialogue lifelogs scenario. Experiments across four representative memory systems show that MemOS consistently outperforms others, achieving overall accuracies of 67.59\% and 66.16\% on the benchmarks. This highlights the value of fine-grained memory management and the effectiveness of our benchmarks. Moreover, we show that event-level semantic segmentation of continuous dialogues yields superior results compared to naive chunking, pointing to more effective ways of structuring lifelog memories. In conclusion, we define a continuous dialogue lifelogs scenario, positioning it as a potential cornerstone for next-generation terminal AI assistants.
Primary Area: datasets and benchmarks
Submission Number: 23525
Loading