Keywords: long-context language models, retrieval-augmented generation, context engineering
Abstract: Existing "needle-in-a-haystack" (NIAH) benchmarks for long-context LLM evaluation often overlook "context engineering", using random distractors rather than biased outputs of retrieval systems. We present HaystackCraft, a new NIAH benchmark built on the full English Wikipedia hyperlink network, which evaluates LLMs against ranked distractors from sparse, dense, hybrid, and graph-based retrievers. Experiments on 10 LLMs show significant performance degradation as context size increases. We find that distractor composition is crucial: semantically similar documents are more challenging than lexically similar ones. Graph-based reranking mitigates harmful distractors, improving the LLM performance by up to 44%.
Submission Number: 132
Loading