NoLiMa: Long-Context Evaluation Beyond Literal Matching

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY-NC-SA 4.0
TL;DR: We introduce NoLiMa, a long-context benchmark that removes literal cues from needle-haystack tests, revealing that LLM performance degrades sharply with context length due to difficulty retrieving information without lexical overlap.
Abstract: Recent large language models (LLMs) support long contexts ranging from 128K to 1M tokens. A popular method for evaluating these capabilities is the needle-in-a-haystack (NIAH) test, which involves retrieving a "needle" (relevant information) from a "haystack" (long irrelevant context). Extensions of this approach include increasing distractors, fact chaining, and in-context reasoning. However, in these benchmarks, models can exploit existing literal matches between the needle and haystack to simplify the task. To address this, we introduce NoLiMa, a benchmark extending NIAH with a carefully designed needle set, where questions and needles have minimal lexical overlap, requiring models to infer latent associations to locate the needle within the haystack. We evaluate 13 popular LLMs that claim to support contexts of at least 128K tokens. While they perform well in short contexts (<1K), performance degrades significantly as context length increases. At 32K, for instance, 11 models drop below 50% of their strong short-length baselines. Even GPT-4o, one of the top-performing exceptions, experiences a reduction from an almost-perfect baseline of 99.3% to 69.7%. Our analysis suggests these declines stem from the increased difficulty the attention mechanism faces in longer contexts when literal matches are absent, making it harder to retrieve relevant information. Even models enhanced with reasoning capabilities or CoT prompting struggle to maintain performance in long contexts. We publicly release the dataset and evaluation code at https://github.com/adobe-research/NoLiMa.
Lay Summary: Large language models (LLMs) are increasingly developed to handle very long documents — ranging from articles to entire books. A common way to evaluate this capability is by adding a key piece of information (a "needle") within a large amount of unrelated text (a "haystack"), then asking the model to retrieve it. However, in many existing tests, the needle closely resembles the question, allowing models to succeed through simple pattern matching rather than true understanding. To address this, we introduce NoLiMa, a new benchmark that makes the task more challenging by minimizing the overlap in wording between the question and the relevant information. This requires models to go beyond surface-level matching and instead identify deeper, more abstract connections. We tested 13 well-known language models that advertise support for long inputs. While they perform strongly on short texts, we observe a substantial decline in accuracy as the context length increases. Even state-of-the-art models like GPT-4o show significant drops when literal clues are removed. These findings highlight current limitations in long-context reasoning and point to the need for further research.
Link To Code: https://github.com/adobe-research/NoLiMa
Primary Area: Deep Learning->Large Language Models
Keywords: Long-context, Context length evaluation, Literal match, Lexical gap
Submission Number: 9398
Loading