NeedleBench: Evaluating LLM Retrieval and Reasoning Across Varying Information Densities

Mo Li; Songyang Zhang; Taolin Zhang; Haodong Duan; Yunxin Liu; Kai Chen

NeedleBench: Evaluating LLM Retrieval and Reasoning Across Varying Information Densities

Mo Li, Songyang Zhang, Taolin Zhang, Haodong Duan, Yunxin Liu, Kai Chen

Published: 16 Sept 2025, Last Modified: 16 Sept 2025Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: The capability of large language models to handle long-context information plays a crucial role across various real-world applications. Existing methods for evaluating long-context abilities often rely either on real-world long texts, making it difficult to exclude the influence of models' inherent knowledge, or introduce large amounts of irrelevant filler content to artificially reach target lengths, reducing the relevance and effectiveness of assessments. To address these limitations, we introduce NeedleBench, a comprehensive synthetic framework designed to assess retrieval and reasoning performance in bilingual long-context tasks with adaptive context lengths (e.g., 32k, 128k, and beyond). NeedleBench systematically embeds key data points at varying depths to rigorously test models' capabilities in diverse settings. Tasks within NeedleBench are categorized into two distinct scenarios: information-sparse, characterized by minimal relevant details embedded within extensive irrelevant text to simulate simpler real-world retrieval tasks; and information-dense, implemented as the Ancestral Trace Challenge, where relevant information is continuously distributed throughout the context to simulate more complex real-world reasoning tasks. Our experiments show that, while recent reasoning models such as Deepseek-R1 and OpenAI's o3 have demonstrated strong performance on mathematical reasoning benchmarks, they still struggle to generalize their reasoning abilities and perform poorly on our information-dense tasks, frequently encountering difficulties with continuous retrieval and reasoning even at relatively shorter context lengths.Furthermore, we identify and characterize a phenomenon termed `under-thinking', wherein models prematurely conclude their reasoning processes despite the availability of relevant information. NeedleBench thus provides critical insights and targeted evaluation tools essential for understanding and improving the long-context capabilities of LLMs. All codes and resources are publicly available at https://github.com/open-compass/opencompass.

Submission Length: Regular submission (no more than 12 pages of main content)

Changes Since Last Submission: In response to the Action Editor's final feedback, we have revised the 'Related Work' section. The 'Long-Context Benchmarks' paragraph now includes a more detailed discussion of concurrent benchmarks (e.g., Ruler, LongBench v2, MRCR) and better positions our work in relation to the existing literature.

Code: https://github.com/open-compass/opencompass

Assigned Action Editor: ~Manzil_Zaheer1

Submission Number: 4751

Loading