Sound Probabilistic Safety Bounds for Large Language Models

ICLR 2026 Conference Submission19627 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Models, Rare Event Estimation, LLM Safety
Abstract: We introduce a novel framework for computing rigorous bounds on the probability that a given prompt to a large language model (LLM) generates harmful outputs. We study the applications of classical Clopper–Pearson confidence intervals to derive probably approximately correct (PAC) bounds for this problem and discuss their limitations. As our main contribution, we propose an algorithm that analyses features in the latent space to prioritize the exploration of branches in the autoregressive generation procedure that are more likely to produce harmful outputs. This approach enables the efficient computation of formal guarantees even in scenarios where the true probability of harmfulness is extremely small. Our experimental results demonstrate the effectiveness of the method by computing non-trivial lower bounds for state-of-the-art LLMs.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 19627
Loading