Beyond Hallucinations: A Composite Score for Measuring Reliability in Open-Source Large Language Models
Confirmation: I have read and agree with the workshop's policy on behalf of myself and my co-authors.
Keywords: Large Language Models, Reliability, Calibration, Robustness, Uncertainty Quantification, Composite Reliability Score, Safe AI, Model Evaluation, Trustworthy AI, Error Detection, Post-hoc Calibration, Benchmarking
TL;DR: We propose the Composite Reliability Score (CRS), a unified metric that integrates calibration, robustness, and uncertainty to holistically evaluate and compare LLM reliability.
Abstract: Large Language Models (LLMs) like LLaMA, Mistral, and
Gemma are increasingly used in decision-critical domains
such as healthcare, law, and finance, yet their reliability re-
mains uncertain. They often make overconfident errors, de-
grade under input shifts, and lack clear uncertainty estimates.
Existing evaluations are fragmented, addressing only isolated
aspects.
We introduce the Composite Reliability Score (CRS), a uni-
fied framework that integrates calibration, robustness, and
uncertainty quantification into a single interpretable met-
ric. Through experiments on ten leading open-source LLMs
across five QA datasets, we assess performance under base-
lines, perturbations, and calibration methods. CRS delivers
stable model rankings, uncovers hidden failure modes missed
by single metrics, and highlights that the most dependable
systems balance accuracy, robustness, and calibrated uncer-
tainty
Submission Track: Workshop Paper Track
Submission Number: 2
Loading