Keywords: Hallucination Detection; Manifold
Abstract: Large language models (LLMs) exhibit remarkable capabilities across various tasks but are prone to generating hallucinations, raising significant concerns about their reliability. Existing approaches for detecting hallucinations in unlabeled, real-world data often utilize information from the latent feature space. However, these studies have not thoroughly analyzed the sample distributions within this space and typically rely on linear separation methods. To better characterize these distributions, we introduce Hallucination Attention Regions (HARs) and True Attention Regions (TARs) to model the latent-space representations of hallucinated and truthful samples, respectively. Our empirical analysis reveals that HARs and TARs are nonlinearly separable. Based on this finding, we hypothesize that these high-dimensional distributions can be embedded into a low-dimensional manifold. We thus propose the HDME framework for automatically detecting hallucinations in unlabeled data. This framework comprises three steps: (1) projecting high-dimensional samples onto a low-dimensional manifold, (2) clustering the embedded data to generate pseudo-labels, and (3) training a hallucination detector with these pseudo-labels. Extensive experiments demonstrate that our method achieves superior performance in hallucination detection across diverse datasets.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 10424
Loading