A Systematic Assessment of Weak-to-Strong Confidence Prediction in Large Language Models

Published: 10 May 2026, Last Modified: 10 May 2026Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: As large language models (LLMs) are deployed in increasingly diverse applications, understanding their capacity through uncertainty quantification (UQ) is crucial for ensuring safe and reliable behavior. Reliable uncertainty estimates that accompany the text generated by an LLM can signal when a response is likely to be incorrect and thus serve as an effective fail-safe mechanism against hallucinations. We study the extent to which a smaller and weaker open-access model, using only question embeddings and a lightweight probe, can predict the probability that a stronger black-box generator answers a query correctly. Across six benchmarks, two generators, and fifteen open-access predictors, we find that this simple approach provides useful confidence estimates: embeddings from models as small as Llama3-8b achieve 83.4\% AUROC on TriviaQA and 64.3\% on MMLU, and improve selective generator accuracy by up to 17.9\%. Our analysis shows that performance is not determined by predictor size alone, but depends more strongly on representational compatibility between weak model embeddings and strong model correctness. The signal is robust to decoding configurations, label imbalance, and embedding aggregation choices, but is weaker on reasoning-heavy benchmarks such as SuperGPQA and transfers poorly across datasets. These findings suggest that weak-to-strong probes are best viewed as lightweight in-distribution confidence estimators: after generator-based labels are collected for training, they provide efficient deployment-time uncertainty estimates without repeated generator sampling. Overall, our results provide a systematic baseline for studying scalable oversight of black-box LLMs. Our code and data are available at: https://github.com/YukaiYang0803/w2s-confidence-prediction.
Submission Type: Regular submission (no more than 12 pages of main content)
Code: https://github.com/YukaiYang0803/w2s-confidence-prediction
Assigned Action Editor: ~Jeff_Phillips1
Submission Number: 6770
Loading