Quantifying Genuine Awareness in Hallucination Prediction: Disentangling Question-Side Shortcuts

Published: 02 Mar 2026, Last Modified: 05 Mar 2026Agentic AI in the Wild: From Hallucinations to Reliable Autonomy PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: hallucination, LLM, self-awareness
TL;DR: We use a Shapley-based AQE to separate question-side shortcuts from model-side self-awareness in hallucination prediction, showing shortcut-heavy detectors fail under shift.
Abstract: Many works have proposed methodologies for language model (LM) hallucination detection and reported seemingly strong performance. However, we argue that the reported performance to date reflects not only a model’s genuine awareness of its internal information, but also awareness derived purely from question-side information (e.g., benchmark hacking). While benchmark hacking can be effective for boosting hallucination detection score on existing benchmarks, it does not generalize to out-of-domain settings and practical usage. Nevertheless, disentangling how much of a model’s hallucination detection performance arises from question-side awareness is non-trivial. To address this, we propose a methodology for measuring this effect without requiring human labor, Approximate Question-side Effect (AQE). Our analysis using AQE reveals that existing hallucination detection methods rely heavily on benchmark hacking.
Submission Number: 42
Loading