Keywords: long-horizon memory, conversational agents, language models, canonical memory typing, dense retrieval, long-context QA, retrieval-augmented generation, temporal and multi-hop reasoning, LLM-as-a-judge, efficiency and reproducibility
TL;DR: ENGRAM is a simple typed-memory system with a single router and retriever that achieves SOTA on LoCoMo and outperforms the full-context baseline on LongMemEval.
Abstract: Large language models (LLMs) deployed in user-facing applications require long-horizon consistency: the capacity to remember prior interactions, respect user preferences, and ground reasoning in past events. However, contemporary memory systems often adopt complex architectures such as knowledge graphs, multi-stage retrieval, and operating-system–style schedulers, which introduce engineering complexity and reproducibility challenges. We present ENGRAM, a lightweight state-of-the-art memory system that organizes conversation into three canonical memory types—episodic, semantic, and procedural—through a single router and retriever. Each user turn is converted into typed memory records with normalized schemas and embeddings and persisted in a database. At query time, the system retrieves top-k dense neighbors per type, merges results with simple set operations, and provides relevant evidence as context to the model. ENGRAM attains state-of-the-art results on the LoCoMo benchmark, a realistic multi-session conversational question-answering (QA) suite for long-horizon memory, and exceeds the full-context baseline by 15 absolute points on LongMemEval, an extended-horizon conversational benchmark, while using only ${\approx}\$1% of the tokens. Our results suggest that careful memory typing and straightforward dense retrieval enable effective long-term memory management in language models, challenging the trend toward architectural complexity in this domain.
Primary Area: other topics in machine learning (i.e., none of the above)
Submission Number: 22904
Loading