Minerva: A Programmable Memory Test Benchmark for Language Models

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY-NC-SA 4.0
Abstract: How effectively can LLM-based AI assistants utilize their memory (context) to perform various tasks? Traditional data benchmarks, which are often manually crafted, suffer from several limitations: they are static, susceptible to overfitting, difficult to interpret, and lack actionable insights--failing to pinpoint the specific capabilities a model lacks when it does not pass a test. In this paper, we present a framework for automatically generating a comprehensive set of tests to evaluate models' abilities to use their memory effectively. Our framework extends the range of capability tests beyond the commonly explored (passkey, key-value, needle in the haystack) search, a dominant focus in the literature. Specifically, we evaluate models on atomic tasks such as searching, recalling, editing, matching, comparing information in context memory, performing basic operations when inputs are structured into distinct blocks, and maintaining state while operating on memory, simulating real-world data. Additionally, we design composite tests to investigate the models' ability to perform more complex, integrated tasks. Our benchmark enables an interpretable, detailed assessment of memory capabilities of LLMs.
Lay Summary: How well can LLM-based AI assistants use their memory (or context) to complete different tasks? Many existing benchmarks are manually created and have several drawbacks: they’re fixed, easy for models to overfit, hard to interpret, and don’t reveal exactly what a model struggles with when it fails. In this paper, we introduce a framework for testing memory use in LLMs at scale. While most prior work focuses on simple search task, like finding a key piece of information in a long context (e.g., "needle in a haystack"), our framework goes further. We design a diverse set of task types and automatically generate many test examples for each one. These include fine-grained tasks like searching, recalling, editing, matching, and comparing information in context. We also test whether models can handle structured inputs, keep track of changing information (state). On top of that, we build composite tasks that combine multiple skills, allowing us to assess how well models handle more complex, integrated challenges. Our benchmark offers a detailed and interpretable way to understand the memory capabilities of LLMs.
Primary Area: General Machine Learning->Evaluation
Keywords: LLM evaluation, LLM capability, context utilization, memory benchmark
Submission Number: 12342
Loading