Quantifying Genuine Awareness in Hallucination Prediction Beyond Question-Side Shortcuts

Quantifying Genuine Awareness in Hallucination Prediction Beyond Question-Side Shortcuts

ACL ARR 2026 January Submission9221 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM, hallucination, self-awareness

Abstract: Many works have proposed methodologies for language model (LM) hallucination detection and reported seemingly strong performance. However, we argue that the reported performance to date reflects not only a model’s genuine awareness of its internal information, but also awareness derived purely from question-side information (e.g., benchmark hacking). While benchmark hacking can be effective for boosting hallucination detection score on existing benchmarks, it does not generalize to out-of-domain settings and practical usage. Nevertheless, disentangling how much of a model’s hallucination detection performance arises from question-side awareness is non-trivial. To address this, we propose a methodology for measuring this effect without requiring human labor, Approximate Question-side Effect (AQE). Our analysis using AQE reveals that existing hallucination detection methods rely heavily on benchmark hacking.

Paper Type: Long

Research Area: Language Models

Research Area Keywords: safety and alignment, LLM/AI agents

Contribution Types: Model analysis & interpretability

Languages Studied: english

Submission Number: 9221

Loading