Keywords: Alignment, LLMs, Uncertainty, Hallucinations, Factuality, Safety
Abstract: Large Language Models (LLMs) have emerged as powerful tools for knowledge-intensive tasks, yet their tendency to generate factually incorrect or misleading outputs—commonly referred to as hallucinations—poses a fundamental challenge to their reliability. While uncertainty estimation is critical for mitigating such errors, LLMs are not explicitly trained to represent or express uncertainty. In this work, we investigate whether and how uncertainty is implicitly encoded within pretrained models. Through a probing-based analysis, we demonstrate that LLMs internalize multiple distinct and dataset-specific uncertainty signals, which can be extracted as linear directions in their latent space. These signals are most pronounced in intermediate layers, exhibit limited cross-task generalization, and are substantially enhanced by instruction-tuning and [IDK]-token training. Building on these findings, we propose a novel framework that leverages a unified uncertainty direction to train LLMs to classify their own correctness. Our experiments show that this approach significantly improves factual precision and reduces hallucination rates under zero-shot evaluation. Together, these results provide new insights into the internal structure of uncertainty in LLMs and introduce a practical method for aligning models toward more trustworthy behavior.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 14549
Loading