Keywords: Adversarial attack
TL;DR: This paper proposes Shadows, a query-free jailbreak pipeline against Text-to-Image (T2I) models by leveraging bimodal guidance from both textual and visual modalities.
Abstract: Jailbreaks against Text-to-Image (T2I) models can be used to evaluate models’ vulnerability in generating Not Safe For Work (NSFW) visual content. LLM-powered query-free jailbreaks are particularly promising because their optimization does not require expensive and easily detectable query interactions with the target model. However, we identify two problems of existing LLM-powered query-free jailbreaks: (1) in the textual modality, limiting the safety criteria to individual words but neglecting the contextual information and (2) overlooking the supervision from the visual modality, despite the ultimate jailbreak goal is to generate accurate NSFW visual content. To address these problems, we propose Shadows, a new query-free jailbreak pipeline with bimodal (textual and visual) guidance. Specifically, the textual guidance comes from the contextual information via topic assistance and sentence expansion, and the visual guidance comes from additional prompt-image perceptual consistency using surrogate T2I and CLIP models. Large-scale experiments on 16 (8 normal and 8 unlearned) open-source T2I models with defensive text checkers and 4 commercial T2I APIs with built-in defenses demonstrate the effectiveness of Shadows. For example, on the unlearned model SafeGen, compared to the previous best query-free approach,
Shadows achieves up to a 2× success rate in bypassing the semantic-based text checker and a 4× success rate in eventually generating NSFW images.
Supplementary Material: zip
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 15138
Loading