Tight PAC-Bayes Generalisation Guarantees for Large Language Model Safety Monitoring

Published: 26 May 2026, Last Modified: 26 May 2026ICML 2026 FoGen Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM Safety, Generalisation, PAC-Bayes, Compression
TL;DR: We show that generalisation in LLM safety oversight models can be understood through adaptor description length, with simple, compressible adaptations yielding tight, non-vacuous PAC-Bayes guarantees for both accuracy and uncertainty.
Abstract: How can we ensure that safety oversight models used to detect safety violations in AI systems reliably generalise, and how can we understand the factors that influence their generalisation? In this paper, we formalise PAC-Bayes certification for large language model-based safety oversight and obtain non-vacuous PAC-Bayes guarantees for safety oversight models, even under limited data for safety alignment. Building on compression-based PAC-Bayes bounds, we show that highly compressed PEFT adaptations yield extremely short adaptation description lengths, enabling informative and often tight guarantees that certify both classification risk and predictive uncertainty. We introduce a global-scale quantisation method (LoRA-GT) that reduces adaptor description length while preserving model performance, tightening bounds. Our results show that certifiability is strongly linked to adaptor compressibility, with shorter adaptor description lengths yielding tighter guarantees. Empirically, highly compressed adaptations exhibit minimal degradation in performance while enabling substantially stronger certification, suggesting that certifiable large language model oversight may naturally favour low-complexity safety adaptations. We further show that functional distortion tracks both test risk and bound tightness under compression, providing a practical mechanism for selecting simple, certifiable safety adaptations. Together, these results show that compression-based PAC-Bayes analysis provides a practical framework for understanding and designing reliable safety oversight models.
Submission Number: 204
Loading