Irrelevant Context Helps: Understanding the Impact of Context in Large Language Models

ICLR 2026 Conference Submission21132 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Multi-turn, LLMs, Historical Context
TL;DR: We introduce When2Read, the first benchmark for studying the effects of question-irrelevant context on LLMs, showing it can both help and hurt performance, and propose output length as a simple heuristic for adaptive context use.
Abstract: Existing context management approaches assume question-irrelevant historical context is detrimental, overlooking the possibility that seemingly unrelated conversations may provide cognitive activation benefits similar to those observed in human problem-solving. We introduce contextual utility prediction to address this gap and present When2Read—a benchmark comprising 31,900 comparison pairs across eight domains with systematic manipulation of context dimensions including task consistency, multi-turn depth, question count, and difficulty order. Evaluating eight state-of-the-art LLMs, we demonstrate that the impact of question-irrelevant historical context differs across tasks and settings. In some cases it improves model performance (up to 18.6%}), while in others it results in performance drops (as much as 7.3%}). These bidirectional effects indicate that question-irrelevant context is neither consistently beneficial nor consistently harmful, but highly situation-dependent, leaving no simple rule for when to use it. Inspired by insights from cognitive science, we use output length as a heuristic indicator: models that generate shorter responses than their no-context baseline tend to perform poorly in over 90\% of cases, offering a training-free criterion for context selection. Our findings challenge the assumption that question-irrelevant context universally harms performance and offer a practical solution for adaptive context management without model retraining. This work establishes a foundation for dialogue systems that strategically leverage diverse conversational histories.
Primary Area: datasets and benchmarks
Submission Number: 21132
Loading