Detection without Expression: A Geometric perspective of Language Model Hallucination

Detection without Expression: A Geometric perspective of Language Model Hallucination

TMLR Paper9218 Authors

26 May 2026 (modified: 11 Jun 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Language models often respond fluently and confidently to questions for which the appropriate response would be to abstain. We study cases where the prompt is underspecified, has a false premise, or is outside the model's reliable knowledge. Such errors are usually treated as failures of factual access. We argue that they also reflect a failure of routing. A model may internally represent that an input should not be answered while failing to transform that representation into output behavior. Cross-entropy training creates prediction-aligned directions through which token commitments are expressed, because each example supplies a sharp gradient toward a vocabulary target. Answerability, however, is not given an equally stable target unless the training distribution explicitly rewards abstention. It can therefore be encoded as an input-aligned feature of the residual stream without becoming a prediction-aligned control variable. In this view, hallucination can be understood as a mismatch between the geometry that detects uncertainty and the geometry that expresses decisions. Across autoregressive transformer families, we find that factual and uncertain prompts are strongly separated in hidden states, while standard output-side uncertainty measures expose only a weak trace of this distinction. The answerability boundary is concentrated in the principal input geometry and only inconsistently aligned with the prediction geometry defined by the unembedding. Causal interventions confirm that this geometry is not merely diagnostic: routing the hidden answerability signal directly to refusal logits produces selective abstention, boundary steering produces large direction-dependent shifts in decoded responses, and linear projection onto the factual subspace does not repair uncertain states. These results suggest that reducing hallucination requires mechanisms that explicitly connect internal answerability representations to the output pathways where linguistic commitments are made.

Submission Type: Regular submission (no more than 12 pages of main content)

Previous TMLR Submission Url: https://openreview.net/forum?id=HtHycZg6Rk&referrer=%5BAuthor%20Console%5D(%2Fgroup%3Fid%3DTMLR%2FAuthors%23your-submissions)

Changes Since Last Submission: We fixed the margins problem that caused the desk reject. Probably a usepackage created the problem.

Assigned Action Editor: ~Alberto_Bietti1

Submission Number: 9218

Loading