Abstract: Hallucination detection is essential for ensuring the reliability of large language models. Internal representation–based methods have emerged as the prevailing direction for detecting hallucinations, yet the internal representations often fail to yield clear separability between truthful and hallucinatory content. To address this challenge, we study the separability of the sensitivity to prompt-induced perturbations in the internal representations. A theory is established to show that, with non-negligible probability, each sample admits a prompt under which truthful samples exhibit greater sensitivity to prompt-induced perturbations than hallucinatory samples. In an oracle setting on representative datasets, such separability can be observed in nearly $99\%$ of samples, suggesting the potential of perturbation sensitivity as a discriminative indicator. Building on this insight, we propose Sample-Specific Prompting (SSP), which adaptively selects prompts to perturb the model’s internal states and measures the resulting sensitivity as a detection indicator. Extensive experiments across multiple benchmarks demonstrate that SSP consistently outperforms existing hallucination detection methods, validating the practical effectiveness of our method SSP in hallucination detection.
Submission Type: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: We have highlighted all modifications in green.
Assigned Action Editor: ~Kamil_Ciosek1
Submission Number: 7796
Loading