From small to large language models: How much confidence can we have?

From small to large language models: How much confidence can we have?

ICLR 2026 Conference Submission20676 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models, Uncertainty Quantification, Multi-label classification

TL;DR: We propose a new multi-label dataset and compare uncertainty quantification for models trained via discriminative or generative fine-tuning and for prompting-based approaches.

Abstract: In this paper, we provide novel insights into information-based, consistency-based and self-verbalized uncertainty quantification (UQ) for multi-label text classification across a range of recent language models on a new unsaturated benchmark of medical device adverse event reports with interdependent labels. We compare more than twenty encoder- and decoder-only language models across three paradigms: discriminative fine-tuning, generative fine-tuning, and few-shot in-context prompting (instruction-tuned and reasoning variants, local and API-accessible). UQ is performed using token-information measures, consistency under stochastic generation, and self-verbalized confidence, with utility assessed via selective prediction. We provide practical guidance on model selection, when fine-tuning is preferable to prompting, and which UQ signals are most effective for routing and human-in-the-loop triage. Our results reveal trade-offs across model types. Discriminatively fine-tuned decoders achieve the strongest head–tail accuracy while still offering solid uncertainty quantification (UQ). In contrast, generative fine-tuning provides the most reliable UQ overall. Reasoning models improve performance on extreme-tail labels but yield weak UQ. Finally, self-verbalized confidence proves unreliable as an indicator of model certainty.

Primary Area: datasets and benchmarks

Submission Number: 20676

Loading