Beyond Hallucinations: A Composite Score for Measuring Reliability in Open-Source Large Language Models

Published: 06 Nov 2025, Last Modified: 09 Dec 2025AIR-FM PosterEveryoneRevisionsBibTeXCC BY 4.0
Confirmation: I have read and agree with the workshop's policy on behalf of myself and my co-authors.
Keywords: Large Language Models, Reliability, Calibration, Robustness, Uncertainty Quantification, Composite Reliability Score, Safe AI, Model Evaluation, Trustworthy AI, Error Detection, Post-hoc Calibration, Benchmarking
TL;DR: We propose the Composite Reliability Score (CRS), a unified metric that integrates calibration, robustness, and uncertainty to holistically evaluate and compare LLM reliability.
Abstract: Large Language Models (LLMs) like LLaMA, Mistral, and Gemma are increasingly used in decision-critical domains such as healthcare, law, and finance, yet their reliability re- mains uncertain. They often make overconfident errors, de- grade under input shifts, and lack clear uncertainty estimates. Existing evaluations are fragmented, addressing only isolated aspects. We introduce the Composite Reliability Score (CRS), a uni- fied framework that integrates calibration, robustness, and uncertainty quantification into a single interpretable met- ric. Through experiments on ten leading open-source LLMs across five QA datasets, we assess performance under base- lines, perturbations, and calibration methods. CRS delivers stable model rankings, uncovers hidden failure modes missed by single metrics, and highlights that the most dependable systems balance accuracy, robustness, and calibrated uncer- tainty
Submission Track: Workshop Paper Track
Submission Number: 2
Loading