HALLUCINATION AS MISCLASSIFICATION: A COMPOSITE ABSTENTION ARCHITECTURE FOR LANGUAGE MODEL OUTPUT CONTROL

Published: 05 Mar 2026, Last Modified: 09 Mar 2026ICLR 2026 Workshop LLM ReasoningEveryoneRevisionsBibTeXCC BY 4.0
Track: long paper (up to 10 pages)
Keywords: Hallucination; Abstention; Selective Prediction; Uncertainty Estimation; Output Gating; LLM Reliability; Self-Consistency; Black-Box Evaluation; Refusal Behavior; Safety Controls
TL;DR: A simple, black-box abstention gate combined with instruction-based refusal sharply reduces hallucinations by blocking unsupported LLM outputs under low evidence or conflicting context.
Abstract: Large language models routinely produce unsupported claims -a failure termed hallucination. We propose a control-theoretic framing: hallucination is a mis- classification error at the output boundary, where internally generated comple- tions are emitted as if grounded in evidence. This framing motivates a compos- ite intervention combining instruction-based refusal with a structural abstention gate. The gate computes a support deficit score St from three black-box signals— self-consistency (At), paraphrase stability (Pt), and citation coverage (Ct)—and blocks output when St exceeds a threshold. In a controlled evaluation across 50 items, five epistemic regimes, and three models (GPT-4o-mini, GPT-4o, GPT-3.5- turbo), neither mechanism alone was sufficient: instruction-only prompting re- duced hallucination sharply, but exhibited over-cautious abstention on 10% of an- swerable items for GPT-4o-mini and GPT-4o, and residual hallucination for GPT- 3.5-turbo (6% overall; driven primarily by conflicting-evidence items). The struc- tural gate preserved 100% answerable accuracy across models but missed confi- dent confabulation on conflicting-evidence items (70% hallucination for GPT-4o- mini and GPT-4o). The composite architecture achieved 96–98% overall accu- racy with 0–4% hallucination, while inheriting the instruction component’s 10% abstention on answerable items for GPT-4o-mini and GPT-4o. A supplementary 100-item no-context stress test derived from TruthfulQA confirmed that struc- tural gating provides a capability-independent abstention floor: instruction-only abstention degraded to 62% for GPT-3.5-turbo, whereas the gate and compos- ite conditions enforced 98–100% abstention across all models.These results are consistent across the tested autoregressive models, though architecture-level gen- erality remains to be established. Overall, instruction-based refusal and structural gating exhibit complementary failure modes—instruction can over-abstain on an- swerable items, while the gate can miss confident confabulation under conflicting evidence—suggesting that effective hallucination control benefits from combining both mechanisms.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Submission Number: 129
Loading