Keywords: Uncertainty, Probing, Contrast-Consistent Search, Concept Embeddings
TL;DR: We extend Contrast-Consistent Search to detect uncertainty in the factual beliefs of language models. We create a toy dataset of timestamped news factoids as a true/false/uncertain classification benchmark for LLMs with a known training cutoff date.
Abstract: Understanding a language model's beliefs about its truthfulness is crucial for building more trustworthy, factually accurate large language models. The recent method of Contrast-Consistent Search (CCS) measures this "latent belief" via a linear probe on intermediate activations of a language model, trained in an unsupervised manner to classify inputs as true or false. As an extension of CCS, we propose Uncertainty-detecting CCS (UCCS), which encapsulates finer-grained notions of truth, such as uncertainty or ambiguity. Concretely, UCCS teaches a probe, using only unlabeled data, to classify a model's latent belief on input text as true, false, or uncertain. We find that UCCS is an effective unsupervised-trained selective classifier, using its uncertainty class to filter out low-confidence truth predictions, leading to improved accuracy across a diverse set of models and tasks. To properly evaluate UCCS predictions of truth and uncertainty, we introduce a toy dataset, named Temporally Measured Events (TYMES), which comprises true or falsified facts, paired with timestamps, extracted from recent news articles from the past several years. TYMES can be combined with any language model's training cutoff date to systematically produce a subset of data beyond (literally, occurring after) the knowledge limitations of the model. TYMES serves as a valuable proof-of-concept for how we can benchmark uncertainty or time-sensitive world knowledge in language models, a setting which includes but extends beyond our UCCS evaluations.
Submission Number: 53
Loading