Keywords: Agentic-AI, LLM, Long-context, Needle-in-a-haystack
TL;DR: Smaller gold contexts significantly degrade LLM performance and amplify positional bias in long-context tasks, revealing a critical but overlooked factor in effective information aggregation.
Abstract: Large language models (LLMs) face significant challenges with needle-in-a-haystack tasks, where relevant information (``the needle``) must be drawn from a large pool of irrelevant context (``the haystack``). Previous studies have highlighted positional bias and distractor quantity as critical factors affecting model performance, yet the influence of $\textit{gold context size}$, the length of the answer-containing document, has received little attention. We present the first systematic study of gold context size in long-context question answering, spanning three diverse benchmarks (general knowledge, biomedical reasoning, and mathematical reasoning), eleven state-of-the-art LLMs (including recent reasoning models), and more than 150K controlled runs. Our experiments reveal that LLM performance drops sharply when the gold context is shorter, i.e., $\textbf{smaller gold contexts consistently degrade model performance and amplify positional sensitivity}$, posing a major challenge for agentic systems that must integrate scattered, fine-grained information of $\textit{varying lengths}$. This effect persists under rigorous confounder analysis: even after controlling for gold document position, answer token repetition, gold-to-distractor ratio, distractor volume, and domain specificity, gold context size remains a decisive, independent predictor of success. Our work provides clear insights to guide the design of robust, context-aware LLM-driven systems.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 21106
Loading