Keywords: Large Language Models (LLMs), Reliability, Hallucination Detection, Safety, Jailbreak Attacks, Prompt Injection, Statistical Guarantees, Robustness, Trustworthy AI
Abstract: Large Language Models (LLMs) exhibit robust abilities but are still vulnerable to factual hallucinations, unsafe responses, and adversarial attacks — issues hindering deployment in safety-critical applications. Current benchmarks assess significant but disjointed facets of risk and do not capture principled uncertainty quantification or defense compositional analysis. We propose SAFE-LLM, a cohesive, auditable evaluation framework for Reliability, Safety, and Security of LLMs. SAFE-LLM offers: (i) a fine-grained taxonomy of risk situations; (ii) standardized metrics (Hallucination Rate, Safety Compliance Index, Jailbreak Success Rate, Prompt Injection Compromise Rate) with finite-sample and sequential con-
fidence guarantees; (iii) theoretical results on coverage, sequential error control, sample complexity, defense composition, and adaptive adversary bounds; and (iv) a defense-aware benchmarking protocol and reporting format. We demonstrate how SAFE-LLM fills specific gaps in existing practice, outline the road to real-world audits, and address the social impact of taking SAFE-LLM as a standard for
reliable LLM deployment.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 13424
Loading