everyone
since 04 Oct 2024">EveryoneRevisionsBibTeXCC BY 4.0
In this work, we investigate the susceptibility of Audio Language Models (ALMs) to audio-based jailbreaks and introduce Best-of-N (BoN) Jailbreaking, a black-box jailbreaking algorithm to extract harmful information from ALMs. To craft jailbreak inputs, our approach samples audio augmentations and applies them to malicious prompts. We repeat this process until we find a set of augmentations that elicits a harmful response from the target ALM. Empirically, we find that applying BoN with 7000 sampled augmentations achieves an attack success rate (ASR) of over 60% on all models tested, including the preview model for the released GPT-4o. Furthermore, we uncover power laws that accurately predict the ASR of BoN jailbreaking as a function of the number of samples. These power laws allow us to forecast the effectiveness of BoN jailbreaking as a function of the number of sampled augmentations over an order of magnitude. Finally, we show that BoN jailbreaking can be composed with other black-box attack algorithms for even more effective attacks—combining BoN with an optimized prefix attack achieves 98% ASR on Gemini Pro and Flash. Overall, by exploiting stochastic sampling and sensitivity to variations in a high-dimensional input space, we propose a scalable, composable, and highly effective black-box algorithm for attacking state-of-the-art ALMs.