HALLUCINATION AS MISCLASSIFICATION: A COMPOSITE ABSTENTION ARCHITECTURE FOR LANGUAGE MODEL OUTPUT CONTROL
Track: long paper (up to 10 pages)
Keywords: Hallucination; Abstention; Selective Prediction; Uncertainty Estimation; Output Gating; LLM Reliability; Self-Consistency; Black-Box Evaluation; Refusal Behavior; Safety Controls
TL;DR: A simple, black-box abstention gate combined with instruction-based refusal sharply reduces hallucinations by blocking unsupported LLM outputs under low evidence or conflicting context.
Abstract: Large language models routinely produce unsupported claims -a failure termed
hallucination. We propose a control-theoretic framing: hallucination is a mis-
classification error at the output boundary, where internally generated comple-
tions are emitted as if grounded in evidence. This framing motivates a compos-
ite intervention combining instruction-based refusal with a structural abstention
gate. The gate computes a support deficit score St from three black-box signals—
self-consistency (At), paraphrase stability (Pt), and citation coverage (Ct)—and
blocks output when St exceeds a threshold. In a controlled evaluation across 50
items, five epistemic regimes, and three models (GPT-4o-mini, GPT-4o, GPT-3.5-
turbo), neither mechanism alone was sufficient: instruction-only prompting re-
duced hallucination sharply, but exhibited over-cautious abstention on 10% of an-
swerable items for GPT-4o-mini and GPT-4o, and residual hallucination for GPT-
3.5-turbo (6% overall; driven primarily by conflicting-evidence items). The struc-
tural gate preserved 100% answerable accuracy across models but missed confi-
dent confabulation on conflicting-evidence items (70% hallucination for GPT-4o-
mini and GPT-4o). The composite architecture achieved 96–98% overall accu-
racy with 0–4% hallucination, while inheriting the instruction component’s 10%
abstention on answerable items for GPT-4o-mini and GPT-4o. A supplementary
100-item no-context stress test derived from TruthfulQA confirmed that struc-
tural gating provides a capability-independent abstention floor: instruction-only
abstention degraded to 62% for GPT-3.5-turbo, whereas the gate and compos-
ite conditions enforced 98–100% abstention across all models.These results are
consistent across the tested autoregressive models, though architecture-level gen-
erality remains to be established. Overall, instruction-based refusal and structural
gating exhibit complementary failure modes—instruction can over-abstain on an-
swerable items, while the gate can miss confident confabulation under conflicting
evidence—suggesting that effective hallucination control benefits from combining
both mechanisms.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Submission Number: 129
Loading