Keywords: Large Language Models, Rare Event Estimation, LLM Safety
Abstract: We introduce a novel framework for computing rigorous bounds on the probability that a given prompt to a large language model (LLM) generates harmful outputs. We study the applications of classical Clopper–Pearson confidence intervals to derive probably approximately correct (PAC) bounds for this problem and discuss their limitations. As our main contribution, we propose an algorithm that analyses features in the latent space to prioritize the exploration of branches in the autoregressive generation procedure that are more likely to produce harmful outputs. This approach enables the efficient computation of formal guarantees even in scenarios where the true probability of harmfulness is extremely small. Our experimental results demonstrate the effectiveness of the method by computing non-trivial lower bounds for state-of-the-art LLMs.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 19627
Loading