SAFE-LLM: A Unified Framework for Reliable, Safe, And Secure Evaluation of Large Language Models

SAFE-LLM: A Unified Framework for Reliable, Safe, And Secure Evaluation of Large Language Models

ICLR 2026 Conference Submission13424 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models (LLMs), Reliability, Hallucination Detection, Safety, Jailbreak Attacks, Prompt Injection, Statistical Guarantees, Robustness, Trustworthy AI

Abstract: Large Language Models (LLMs) exhibit robust abilities but are still vulnerable to factual hallucinations, unsafe responses, and adversarial attacks — issues hindering deployment in safety-critical applications. Current benchmarks assess significant but disjointed facets of risk and do not capture principled uncertainty quantification or defense compositional analysis. We propose SAFE-LLM, a cohesive, auditable evaluation framework for Reliability, Safety, and Security of LLMs. SAFE-LLM offers: (i) a fine-grained taxonomy of risk situations; (ii) standardized metrics (Hallucination Rate, Safety Compliance Index, Jailbreak Success Rate, Prompt Injection Compromise Rate) with finite-sample and sequential con- fidence guarantees; (iii) theoretical results on coverage, sequential error control, sample complexity, defense composition, and adaptive adversary bounds; and (iv) a defense-aware benchmarking protocol and reporting format. We demonstrate how SAFE-LLM fills specific gaps in existing practice, outline the road to real-world audits, and address the social impact of taking SAFE-LLM as a standard for reliable LLM deployment.

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Submission Number: 13424

Loading