Testing Memory Capabilities in Large Language Models with the Sequence Order Recall Task

Published: 18 Oct 2024, Last Modified: 16 Nov 2024lxai-neurips-24EveryoneRevisionsBibTeXCC BY 4.0
Track: Short Paper
Abstract: Many benchmarks focus on evaluating Large Language Models (LLMs) on facts and semantic relations, primarily assessing their semantic memory. However, some memories in language are linked to their contexts, like time and place, following Human episodic memory. To address the gap in evaluating memory in LLMs, we introduce the Sequence Order Recall Task (SORT). SORT requires LLMs to recall the correct order of text segments from a text excerpt. We present an initial evaluation dataset, Book-SORT, comprising 36000 samples extracted from 9 books recently added to the public domain. When the text is given to models in-context, we find that instruction-tuned LLMs can perform this task. However, when models need to rely memory stored in their weights or not presented with the text excerpts, their accuracies drop below 60%, near or at chance levels. We hope that SORT will drive the development of memory-augmented LLMs.
Submission Number: 24
Loading