Keywords: uncertainty estimation; large language models; linear probing; interpretability
Abstract: Uncertainty estimation in Large Language Models (LLMs) is challenging because token-level uncertainty includes uncertainty over lexical and syntactical variations, and thus fails to accurately capture uncertainty over the semantic meaning of the generation. To address this, Farquhar et al. have recently introduced semantic uncertainty (SE), which quantifies uncertainty in the semantic meaning by aggregating token-level probabilities of generations if they are semantically equivalent. Kossen et al. further demonstrated that SE can be cheaply and reliably captured using linear probes on the model hidden states. In this work, we build on these results and show that semantic uncertainty in LLMs can be predicted from only a very small set of neurons. We find these neurons by training linear probes with $L_1$ regularization. Our approach matches the performance of full-neuron probes in predicting SE. An intervention study further shows these neurons causally affect the semantic uncertainty of model generations. Our findings reveal how hidden-state neurons encode semantic uncertainty, present a method to manipulate this uncertainty, and contribute insights for the field of interpretability research.
Email Of Author Nominated As Reviewer: jiatong.han@u.nus.edu
Submission Number: 7
Loading