Attacking Audio Language Models with Best-of-N Jailbreaking

John Hughes; Sara Price; Aengus Lynch; Rylan Schaeffer; Fazl Barez; Sanmi Koyejo; Henry Sleight; Ethan Perez; Mrinank Sharma

Attacking Audio Language Models with Best-of-N Jailbreaking

John Hughes, Sara Price, Aengus Lynch, Rylan Schaeffer, Fazl Barez, Sanmi Koyejo, Henry Sleight, Ethan Perez, Mrinank Sharma

24 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: adversarial robustness, jailbreaks, audio language model, speech language model, multimodal, adversarial attack, audio jailbreak, safety, trustworthy, robustness

TL;DR: We introduce BoN Jailbreaking: a composable, and highly effective black-box algorithm for attacking state-of-the-art ALMs.

Abstract: In this work, we investigate the susceptibility of Audio Language Models (ALMs) to audio-based jailbreaks and introduce Best-of-N (BoN) Jailbreaking, a black-box jailbreaking algorithm to extract harmful information from ALMs. To craft jailbreak inputs, our approach samples audio augmentations and applies them to malicious prompts. We repeat this process until we find a set of augmentations that elicits a harmful response from the target ALM. Empirically, we find that applying BoN with 7000 sampled augmentations achieves an attack success rate (ASR) of over 60% on all models tested, including the preview model for the released GPT-4o. Furthermore, we uncover power laws that accurately predict the ASR of BoN jailbreaking as a function of the number of samples. These power laws allow us to forecast the effectiveness of BoN jailbreaking as a function of the number of sampled augmentations over an order of magnitude. Finally, we show that BoN jailbreaking can be composed with other black-box attack algorithms for even more effective attacks—combining BoN with an optimized prefix attack achieves 98% ASR on Gemini Pro and Flash. Overall, by exploiting stochastic sampling and sensitivity to variations in a high-dimensional input space, we propose a scalable, composable, and highly effective black-box algorithm for attacking state-of-the-art ALMs.

Supplementary Material: zip

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 3877

Loading