Abstract: Speech Language Models (SLMs) accept audio as input, allowing users to interact via spoken instructions, and potentially enabling a more nuanced acoustic understanding. However, this enhanced functionality introduces new security risks as it enables adversaries to easily bypass safety mechanisms by injecting noise into the input. In this work, we analyze the vulnerability of open-source SLMs to such attacks and evaluate various defense mechanisms. We find that these models are susceptible to jailbreak attacks with 100\% attack success rates in some instances. We propose post hoc defense techniques that include activation patching to improve robustness up to 99\% with a negligible impact on utility. Additionally, we evaluate defenses applied at both the audio encoder and the language model components, weighing their impact on adversarial resistance and usability.
Paper Type: Long
Research Area: Ethics, Bias, and Fairness
Research Area Keywords: Jailbreaking, LLM privacy, Speech, Mechanistic Interpretability
Contribution Types: NLP engineering experiment
Languages Studied: English
Submission Number: 7449
Loading