Transferring Jailbreak Attacks from Public to Private LLMs via Local Prompt Optimization

ICLR 2026 Conference Submission16311 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Jailbreak Attack, Large Language Models
Abstract: Large Language Models (LLMs) demonstrate remarkable capabilities across natural language processing tasks but remain vulnerable to jailbreak attacks, where adversarial inputs are crafted to elicit harmful or undesirable responses. Existing optimization-based attacks often achieve high success rates but are impractical in black-box settings. We focus on a practical scenario in which private LLMs are fine-tuned from public models and accessible only via query APIs, reflecting common real-world deployments. To address this, we propose a two-stage local prompt optimization framework that transfers jailbreak attacks from public to private LLMs. Our method introduces an auxiliary adversarial suffix to align output distributions between the public and target private models, enabling gradient-informed optimization in a purely local setup. Experiments show that our approach achieves high attack success rates on both open-source (Vicuna, LLaMA3) and proprietary models (GPT-4, Claude), and remains effective under diverse fine-tuning regimes, including LoRA-based updates. These results highlight the practical security risks of fine-tuning LLMs and the need for robust defenses, while showing that highly transferable black-box attacks can be executed efficiently without accessing private model parameters.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 16311
Loading