Keywords: Jailbreak Attack, Large Language Models
Abstract: Large Language Models (LLMs) demonstrate remarkable capabilities across natural language processing tasks but remain vulnerable to jailbreak attacks, where adversarial inputs are crafted to elicit harmful or undesirable responses. Existing optimization-based attacks often achieve high success rates but are impractical in black-box settings.
We focus on a practical scenario in which private LLMs are fine-tuned from public models and accessible only via query APIs, reflecting common real-world deployments. To address this, we propose a two-stage local prompt optimization framework that transfers jailbreak attacks from public to private LLMs. Our method introduces an auxiliary adversarial suffix to align output distributions between the public and target private models, enabling gradient-informed optimization in a purely local setup. Experiments show that our approach achieves high attack success rates on both open-source (Vicuna, LLaMA3) and proprietary models (GPT-4, Claude), and remains effective under diverse fine-tuning regimes, including LoRA-based updates.
These results highlight the practical security risks of fine-tuning LLMs and the need for robust defenses, while showing that highly transferable black-box attacks can be executed efficiently without accessing private model parameters.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 16311
Loading