Keywords: Large Language Models (LLMs), Reliability, Hallucination Detection, Safety, Jailbreak Attacks, Prompt Injection, Statistical Guarantees, Robustness, Trustworthy AI
Abstract: Large language models (LLMs) exhibit robust abilities but remain vulnerable to factual hallucinations, unsafe responses, and adversarial attacks, which hinder deployment in safety-critical applications. Current benchmarks assess important but disjoint facets of risk and rarely provide principled uncertainty quantification or analysis of how layered defenses interact. We propose SAFE-LLM, a cohesive and auditable evaluation framework for the reliability, safety, and security of LLMs. SAFE-LLM offers: (i) a fine-grained taxonomy of risk situations; (ii) standardized metrics—Hallucination Rate, Safety Compliance Index, Jailbreak Success Rate, and Prompt Injection Compromise Rate—with finite-sample and sequential con-fidence guarantees; (iii) theoretical results on coverage, sequential error control, sample complexity, defense composition, and adaptive adversary bounds; and (iv) a defense-aware benchmarking protocol and reporting format. We demonstrate how SAFE-LLM fills specific gaps in existing practice, outline a path toward real-world audits, and discuss the broader impact of adopting SAFE-LLM as a standard or reliable LLM deployment.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 13424
Loading