Next-Token Gradient Sensitivity Probing for LLM Hallucination Detection

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Models; Hallucination Detection; Maximum Mean Discrepancy
Abstract: Large language models (LLMs) demonstrate remarkable text generation capabilities but often produce hallucinations, e.g., factually inaccurate or unfaithful content, posing significant deployment risks. Detecting such hallucinations remains challenging due to their commonly subtle and localized nature. In this paper, we propose a gradient-based paradigm for hallucination detection by probing next-token prediction sensitivity. Specifically, we introduce a statistic called Next-token Gradient Sensitivity (NGS), which quantifies the first-order gradient of the maximum log-probability of the next token w.r.t. the current token's layer embedding, measuring local prediction fragility. We prove that its norm bounds the prediction confidence changes under small perturbations. Building on NGS, we develop NGS-based Hallucination Detection (NGS-HD), a method that reframes detection as a token-level distribution comparison task. NGS-HD computes the Maximum Mean Discrepancy (MMD) between each NGS of test tokens and NGS distributions from referenced truthful and hallucinated tokens, aggregating these MMDs into a global truthfulness score for detection. We further derive finite-sample separation bounds for this score, providing theoretical guarantees for its reliability. Extensive experiments demonstrate that NGS-HD outperforms baseline methods, offering a reliable and interpretable solution for detecting LLM hallucinations.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 11735
Loading