Keywords: AudioLM, ALM, Benchmark, Dataset, Jailbreak Attacks
Abstract: Large Audio Language Models (LALMs) integrate the audio modality directly into the model, rather than converting speech into text and inputting text to Large Language Models (LLMs).
While jailbreak attacks on LLMs have been extensively studied, the security of LALMs with audio modalities remains largely unexplored.
Currently, there is a lack of an adversarial audio dataset and a unified framework specifically designed to evaluate and compare attacks and LALMs.
In this paper, we present JALMBench, a comprehensive benchmark to assess the safety of LALMs against jailbreak attacks.
JALMBench includes a dataset containing 11,316 text samples and 245,355 audio samples (>1,000 hours).
It supports 12 mainstream LALMs, 4 text-transferred and 4 audio-originated attack methods, and 5 defense methods.
Using JALMBench, we provide an in-depth analysis of attack efficiency, topic sensitivity, voice diversity, and architecture.
Additionally, we explore mitigation strategies for the attacks at both the prompt level and the response level.
We find that LALM safety is strongly influenced by modality and architectural choices,
demonstrating that text-based safety alignment can partially transfer to audio inputs and that interleaving audio-text strategy enables more robust cross-modal generalization of safety.
Moreover, current general moderation for jailbreak only slightly improves security, and calls on the community to explore more defense methods for LALMs.
Our work is the first to systematically uncover these design principles, providing a roadmap for building resilient multimodal language models.
Primary Area: datasets and benchmarks
Submission Number: 1450
Loading