Keywords: certification, LLM agents
Abstract: Large language models (LLMs)-powed agents have demonstrated impressive generative capabilities but remain susceptible to hallucinations, factual inaccuracy, and policy violations. Quantifying and certifying these generation risks is fundamental to ensuring their trustworthy deployment. This paper proposes a unified statistical framework for \emph{certifying LLM agent generation risks} under finite samples. We formalize risk as a bounded functional of the model’s conditional distribution and develop three complementary certification paradigms: (1) \textbf{Concentration-based certification}, which leverages classical inequalities to bound population risk; (2) \textbf{Conformal generation risk certification}, which provides distribution-free, finite-sample guarantees using conformal prediction; and (3) \textbf{Online conformal certification}, which extends these guarantees to temporally dependent or adaptive settings. We establish theoretical coverage guarantees for each paradigm and empirically evaluate them across factuality, toxicity, and policy-violation benchmarks. Our results demonstrate that conformal and online certification achieve valid and adaptive risk coverage while maintaining computational efficiency, paving the way toward practical, provably safe LLM agent deployment.
Submission Number: 19
Loading