Abstract: As large language models (LLMs) are being deployed across a wide range of application domains, understanding their capacity through uncertainty quantification (UQ) is crucial for ensuring safe and reliable behavior. Reliable uncertainty estimates that accompany the text generated by an LLM can signal when a response is likely to be incorrect and thus serve as an effective fail-safe mechanism against hallucinations. In this paper, we explore the extent to which the probability of a frontier model answering a query correctly can be predicted by smaller, weaker models with publicly available embeddings using a simple probe. We show that this probability can be predicted effectively, and the probes are easy to train, making oversight of large proprietary models more widely accessible. Leveraging embeddings from models as small as Llama3-8b, our predictor achieves 83.4% AUROC on TriviaQA and 64.3% on MMLU, and improves selective prediction accuracy by up to 17.9%. We then carefully analyze how different factors impact the probe performance.
Across six benchmarks and fifteen weak predictors, we show that the performance does not simply improve with predictor model size, and that the weak-to-strong signal is robust to label imbalance and embedding aggregation choices. These findings support the view that representational compatibility between weak-model embeddings and the strong model’s behavior matters more than model size alone. Overall, our results advance the understanding of weak-to-strong generalization and provide a simple, scalable framework for building more trustworthy LLMs.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Jeff_Phillips1
Submission Number: 6770
Loading