Keywords: Large Language Models, Hallucination Detection
Abstract: The reliability of Large Language Models (LLMs) in high-stakes domains such as healthcare, law, and scientific discovery is often compromised by hallucinations. These failures typically stem from two sources: *data-driven hallucinations* and *reasoning-driven hallucinations*. However, existing detection methods usually address only one source and rely on task-specific heuristics, limiting their generalization to complex scenarios. To overcome these limitations, we introduce the *Hallucination Risk Bound*, a unified theoretical framework that formally decomposes hallucination risk into data-driven and reasoning-driven components, linked respectively to training-time mismatches and inference-time instabilities. This provides a principled foundation for analyzing how hallucinations emerge and evolve. Building on this foundation, we introduce **HalluGuard**, an NTK-based score that leverages the induced geometry and captured representations of the NTK to jointly identify data-driven and reasoning-driven hallucinations. We validate **HalluGuard** through extensive experiments across 10 benchmarks, 9 LLM backbones, and 11 state-of-the-art detectors, consistently demonstrating its efficacy.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 14994
Loading