Why Input-Level and Output-Level Interventions Are Insufficient for Enforcing Consistency in Large Language Models: A Negative Result

ACL ARR 2026 January Submission2177 Authors

02 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: large language models, consistency enforcement, attention masking, logit biasing, sycophancy, hallucination, adversarial sycophancy, natural language inference, context-aware decoding, inference-time interventions, transformer architectures, negative results
Abstract: Large language models (LLMs) frequently generate outputs that contradict previously established facts---a phenomenon known as \textit{sycophancy} or \textit{hallucination} that undermines reliability in knowledge-intensive applications. While prior work requires expensive retraining or introduces significant overhead, two natural inference-time interventions appear promising: (1) \textbf{input-level} interventions (attention masking) that control what the model attends to, and (2) \textbf{output-level} interventions (logit biasing) that directly constrain generation. Through theoretical analysis and extensive empirical evaluation across 5 language models and 50 adversarial test cases, we demonstrate that \textbf{neither approach succeeds}, but for fundamentally different reasons. Attention masking achieves 0\% improvement due to an architectural gap---the output distribution depends on final hidden states, not attention patterns, making it theoretically unsound. Logit biasing, while theoretically sound, fails due to catastrophic NLI detection failure (2\% contradiction detection rate), revealing a critical benchmark-reality gap. Our negative results provide a taxonomy of failure modes that saves community effort on unproductive directions and points toward Context-Aware Decoding as a more promising alternative that bypasses both limitations.
Paper Type: Long
Research Area: LLM Efficiency
Research Area Keywords: Efficient/Low-Resource Methods for NLP, Language Modeling
Contribution Types: Model analysis & interpretability
Languages Studied: Python
Submission Number: 2177
Loading