Learning to Route LLMs with Confidence Tokens

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: This work introduces Self-REF, a training strategy that teaches large language models to reliably express confidence in their answers, leading to improved performance in downstream tasks like routing and rejection learning.
Abstract: Large language models (LLMs) have demonstrated impressive performance on several tasks and are increasingly deployed in real-world applications. However, especially in high-stakes settings, it becomes vital to know when the output of an LLM may be unreliable. Depending on whether an answer is trustworthy, a system can then choose to route the question to another expert, or otherwise fall back on a safe default behavior. In this work, we study the extent to which LLMs can reliably indicate confidence in their answers, and how this notion of confidence can translate into downstream accuracy gains. We propose Self-Reflection with Error-based Feedback (Self-REF), a lightweight training strategy to teach LLMs to express confidence in whether their answers are correct in a reliable manner. Self-REF introduces confidence tokens into the LLM, from which a confidence score can be extracted. Compared to conventional approaches such as verbalizing confidence and examining token probabilities, we demonstrate empirically that confidence tokens show significant improvements in downstream routing and rejection learning tasks.
Lay Summary: Large language models (LLMs) often give answers without indicating how confident they are. This can be risky in situations where wrong answers have serious consequences. We introduce a method called Self-REF that helps LLMs signal when their answers might be unreliable by assigning a confidence score. We demonstrate that this confidence score is valuable for learning when to route a query to another more powerful LLM, or alternatively reject the query. Compared to existing approaches on four datasets and two base LLMs, Self-REF performs the best on both LLM routing and LLM rejection learning tasks.
Primary Area: Social Aspects->Accountability, Transparency, and Interpretability
Keywords: Large Language Model, LLM Routing, Model Confidence
Submission Number: 2671
Loading