Abstract: Transformers have had a profound impact on the field of artificial intelligence, especially on large language models and their variants. Unfortunately, as was the case historically with neural networks, the black-box nature of transformer architectures presents significant challenges to interpretability and trustworthiness. These challenges generally emerge in high-stakes domains, such as healthcare, robotics, and finance, where incorrect predictions can have significant negative consequences, such as misdiagnosis or failed investments. For models to be genuinely useful and trustworthy in critical applications, they must provide more than just predictions: they must supply users with a clear understanding of the reasoning that underpins their decisions. This paper presents an uncertainty quantification framework for transformer-based language models. This framework, called CONFIDE (CONformal prediction for FIne-tuned DEep language models), applies conformal prediction to the internal embeddings of encoder-only architectures, like BERT and RoBERTa, based on hyperparameters, such as distance metrics and principal component analysis. CONFIDE uses either [CLS] token embeddings or flattened hidden states to construct class-conditional nonconformity scores, enabling statistically valid prediction sets with instance-level explanations. Empirically, CONFIDE improves test accuracy by up to $4.09\%$ on BERT-tiny and achieves greater correct efficiency (i.e., the expected size of the prediction set conditioned on it containing the true label) compared to prior methods, including NM2 and VanillaNN. We show that early and intermediate transformer layers often yield better-calibrated and more semantically meaningful representations for conformal prediction. In resource-constrained models and high-stakes tasks with ambiguous labels, CONFIDE offers robustness and interpretability where softmax-based uncertainty fails. When the exchangeability condition is violated, no method we tested, including CONFIDE, achieves nominal coverage; minority or ambiguous classes often have undercoverage. We, therefore, position CONFIDE as a framework for practical diagnostic and efficiency/robustness improvement over prior conformal
baselines.
Submission Length: Long submission (more than 12 pages of main content)
Changes Since Last Submission: Significant revisions were made in response to reviewer feedback, including clearer articulation of CONFIDE's limitations and a more thorough discussion of its challenges. Additionally, the manuscript has been streamlined to improve readability and ensure that key points are more effectively communicated. Specific changes are listed as replies to reviewer comments.
Assigned Action Editor: ~Sinead_Williamson1
Submission Number: 5157
Loading