Steer LLM Latents for Hallucination Detection

Seongheon Park; Xuefeng Du; Min-Hsuan Yeh; Haobo Wang; Yixuan Li

Steer LLM Latents for Hallucination Detection

Seongheon Park, Xuefeng Du, Min-Hsuan Yeh, Haobo Wang, Yixuan Li

Published: 01 May 2025, Last Modified: 14 Aug 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Hallucinations in LLMs pose a significant concern to their safe deployment in real-world applications. Recent approaches have leveraged the latent space of LLMs for hallucination detection, but their embeddings, optimized for linguistic coherence rather than factual accuracy, often fail to clearly separate truthful and hallucinated content. To this end, we propose the **T**ruthfulness **S**eparator **V**ector (**TSV**), a lightweight and flexible steering vector that reshapes the LLM’s representation space during inference to enhance the separation between truthful and hallucinated outputs, without altering model parameters. Our two-stage framework first trains TSV on a small set of labeled exemplars to form compact and well-separated clusters. It then augments the exemplar set with unlabeled LLM generations, employing an optimal transport-based algorithm for pseudo-labeling combined with a confidence-based filtering process. Extensive experiments demonstrate that TSV achieves state-of-the-art performance with minimal labeled data, exhibiting strong generalization across datasets and providing a practical solution for real-world LLM applications.

Lay Summary: Large language models (LLMs) like ChatGPT have shown impressive capabilities, but they sometimes generate "hallucinations"—statements that appear plausible but are factually inaccurate or unsupported. These hallucinations pose serious risks when LLMs are used in sensitive areas such as healthcare, law, and education. Existing methods to detect hallucinations often rely on the model’s internal information, which is learned for linguistic fluency rather than factual correctness, often resulting in unreliable detection performance. To address this, we introduce the Truthfulness Separator Vector (TSV)—a lightweight, plug-and-play method that helps distinguish truthful from hallucinated responses by modifying the model's internal representation during inference to better separate "factualness," without the need to retrain the model from scratch. TSV learns from a small set of labeled examples and then improves itself by smartly labeling additional unlabeled data using a technique based on optimal transport. Our method achieves state-of-the-art hallucination detection with few labeled examples and works across different tasks and datasets. TSV is computationally efficient and easy to integrate, making it a practical step toward safer, more trustworthy AI systems in real-world applications.

Primary Area: Deep Learning->Large Language Models

Keywords: hallucination detection

Submission Number: 1196

Loading