Breach By A Thousand Leaks: Unsafe Information Leakage in 'Safe' AI Responses

Published: 22 Jan 2025, Last Modified: 28 Feb 2025ICLR 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: AI Safety, Information Theory
TL;DR: Provide a novel, information-theoretic, threat model for AI Safety
Abstract: Vulnerability of Frontier language models to misuse has prompted the development of safety measures like filters and alignment training seeking to ensure safety through robustness to adversarially crafted prompts. We assert that robustness is fundamentally insufficient for ensuring safety goals due to inferential threats from dual-intent queries, with current defenses and evaluations failing to account for these risks. To quantify these risks, we introduce a new safety evaluation framework based on $\textit{impermissible information leakage}$ of model outputs and demonstrate how our proposed question-decomposition attack can extract dangerous knowledge from a censored LLM more effectively than traditional jailbreaking. Underlying our proposed evaluation method is a novel information-theoretic threat model of $\textit{inferential adversaries}$, distinguished from $\textit{security adversaries}$, such as jailbreaks, in that success involves inferring impermissible knowledge from victim outputs as opposed to forcing explicitly impermissible victim outputs. Through our information-theoretic framework, we show that ensuring safety against inferential adversaries requires defenses which bound impermissible information leakage, and, such defenses inevitably incur safety-utility trade-offs.
Supplementary Material: zip
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 7301
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview