GASP: Efficient Black-Box Generation of Adversarial Suffixes for Jailbreaking LLMs

Published: 05 Mar 2025, Last Modified: 14 Apr 2025BuildingTrustEveryoneRevisionsBibTeXCC BY 4.0
Track: Tiny Paper Track (between 2 and 4 pages)
Keywords: LLM Safety, Jailbreak Attacks, Adversarial Vulnerability
TL;DR: GASP is a novel black-box attack framework that efficiently explores the embedding space to generate human-readable adversarial suffixes, significantly improving jailbreak success rates while maintaining prompt coherence.
Abstract: LLMs have demonstrated remarkable capabilities but remain highly susceptible to adversarial prompts despite extensive efforts for safety alignment, raising serious security concerns for their real-world adoptions. Existing jailbreak attacks rely on manual heuristics or computationally expensive optimization techniques, both struggling with generalization and efficiency. In this paper, we introduce GASP, a novel black-box attack framework that leverages latent Bayesian optimization to generate human-readable adversarial suffixes. Unlike prior methods, GASP efficiently explores continuous embedding spaces, optimizing for strong adversarial suffixes while preserving prompt coherence. We evaluate our method across multiple LLMs, showing its ability to produce natural and effective jailbreak prompts. Compared with alternatives, GASP significantly improves attack success rates and reduces computation costs, offering a scalable approach for red-teaming LLMs.
Submission Number: 87
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview