The Sword of DamocleSpeech: Demystifying Jailbreaking Attack in Discrete Token-based Speech Large Language Models

ICLR 2026 Conference Submission17232 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Jailbreak, Speech Large Language Models, Omni Models, Safety
Abstract: Speech Large Language Models (SpeechLLMs) and Omni models have recently achieved remarkable progress in human-like dialogue, prosody, and expressive emotion. How- ever, due to fragmented architectures, diverse training data, and inconsistent alignment strategies, research on jailbreak attacks and safety alignment in SpeechLLMs and Omni models remains limited and largely unsystematic. In this work, we observe that main- stream SpeechLLMs typically employ joint modeling of text tokens and discretized audio tokens, which often adopt distinct generation strategies. However, these discrete token- based SpeechLLMs remain highly vulnerable to prefilling attacks—where inserting a sin- gle token is sufficient to trigger simultaneous jailbreaks in both speech and text modali- ties. Through systematic evaluation of eight mainstream open-source SpeechLLMs and Omni models across three common benchmarks, we find that heterogeneous token co- operation during cross-modal generation leads to a reproducible jailbreak trajectory. To understand the underlying mechanism, we conduct manifold analysis and reveal modal misalignment between discretized audio representations and textual embeddings. In ad- dition, we propose a new set of metrics for evaluating jailbreak effectiveness, offering a multi-perspective assessment of safety bypasses. Our findings highlight fundamental weaknesses in current joint modeling strategies and provide a foundation for designing robust defenses in multimodal generative models. All code and logs are available on our anonymous GitHub.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 17232
Loading