NeedleChain: Measuring Intact Long-Context Reasoning Capability of Large Language Models

Hyeonseok Moon; Heuiseok Lim

NeedleChain: Measuring Intact Long-Context Reasoning Capability of Large Language Models

Hyeonseok Moon, Heuiseok Lim

08 Sept 2025 (modified: 26 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Model, Natural Language Model, Natural Language Evaluation

Abstract: The Needle-in-a-Haystack (NIAH) benchmark is commonly utilized to evaluate the capacity of large language models (LLMs) to manage long contexts by determining whether a model can identify query-relevant information amidst a vast amount of irrelevant text. This paradigm is increasingly regarded as a standard method for quantifying the effective context length of LLMs. However, we find that the context length measured in this manner does not accurately reflect the genuine context understanding capabilities of LLMs. Specifically, even advanced models like GPT-4o face challenges when the context includes only a few query-relevant sentences without irrelevant distractors. To address this, we introduce NeedleChain, a new benchmark designed to evaluate the range of context lengths that allow intact understanding by LLMs. Unlike NIAH, NeedleChain requires models to integrate and reason over the entire input to arrive at the correct answer. This benchmark is adaptable, enabling researchers to adjust both context length and reasoning order for a more thorough analysis of long-context performance. Experiments with various state-of-the-art LLMs reveal a notable gap between their ability to process long inputs and their capacity for full understanding, underscoring the need for benchmarks and methodologies beyond the NIAH paradigm. Additionally, we propose a straightforward yet effective strategy, ROPE Contraction, which directly enhances long-context reasoning without altering the architecture. Throughout this paper, we argue that instead of rapidly extending context length, improving comprehension within a limited range could be more advantageous.

Primary Area: datasets and benchmarks

Submission Number: 2990

Loading