When Long Contexts Break Logic: Separating Evidence Use and Decision Bias in Instruction-Tuned LLMs

Pravish Sainath

When Long Contexts Break Logic: Separating Evidence Use and Decision Bias in Instruction-Tuned LLMs

Pravish Sainath

Published: 08 Mar 2026, Last Modified: 25 Apr 2026ICLR 2026 Workshop LLM ReasoningEveryoneRevisionsBibTeXCC BY 4.0

Track: tiny / short paper (up to 4 pages)

Keywords: long context; logical reasoning

TL;DR: Diagostics of failure of logical reasoning in long contexts

Abstract: Large language models (LLMs) increasingly operate over long contexts, yet their logical reasoning remains brittle when many irrelevant tokens intervene between premises and query. A recurring challenge is \emph{diagnosis}: when an LLM answers incorrectly in a long context, is the failure due to (i) not using the relevant premises, (ii) failing to compose them into a valid inference, or (iii) a biased decision rule at the final Yes/No readout? We present a compact suite of probes that disentangle these failure modes using \emph{matched-prior subtraction}---a distractor-conditioned control prompt that preserves formatting and length while removing the content of the evidence. Across three open instruction-tuned models (Qwen2.5, Llama-3.2, Gemma-2) we find that evidence influence on the final decision is near-zero in early layers and rises sharply only in late layers on a ``needle-in-a-haystack'' variant of LogicBench. For synthetic multi-premise rules (modus tollens, disjunctive syllogism, etc.), we show that many ``oracle'' failures under naive scoring are actually decision-level miscalibration: simple calibrated decision rules raise oracle accuracy to $0.83$--$0.93$ on several rules. Finally, a \emph{local calibratability} analysis reveals that the required decision correction depends systematically on evidence placement (front/middle/end/interleaved), indicating multiple long-context bias regimes rather than a single global calibration.

Presenter: ~Pravish_Sainath1

Format: Yes, the presenting author will attend in person if this work is accepted to the workshop.

Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.

Funding: Yes, the presenting author of this submission falls under ICLR’s funding aims, and funding would significantly impact their ability to attend the workshop in person.

Submission Number: 153

Loading