Online Learning Defense against Iterative Jailbreak Attacks via Prompt Optimization

ACL ARR 2025 February Submission6952 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Iterative jailbreak methods generate harmful output-inducing prompts by repeatedly rewriting and inputting prompts to large language models (LLMs), where each rewrite is based on the previous output results. Despite the iterative jailbreak methods being one of the most powerful techniques, existing defense methods have not implemented proactive measures to disrupt dynamic trial-and-error attempts. In this study, we propose a framework that dynamically updates the defense system through online learning each time the iterative jailbreak method inputs a prompt into the LLM for optimization. Furthermore, prompts generated by jailbreak methods exhibit characteristics such as increased redundancy, complexity, and ambiguity, which deviate from prompts that effectively harness the capabilities of LLMs for harmless tasks. We hypothesize that prompt rewriting techniques that optimize performance on harmless tasks have the potential to prevent jailbreak attacks. To this end, we introduce a reinforcement learning-based method to optimize prompts, ensuring appropriate responses to harmless prompts while rejecting harmful ones. Experiments conducted on three LLMs demonstrate that the proposed method significantly outperforms five existing defense methods against five iterative jailbreak methods. Additionally, our results indicate that the proposed method enhances the quality of responses to harmless prompts, suggesting that prompt optimization can achieve both improved defense against harmful tasks and better performance on harmless tasks.
Paper Type: Long
Research Area: Generation
Research Area Keywords: large language model, jailbreak, safety AI
Contribution Types: NLP engineering experiment
Languages Studied: English
Submission Number: 6952
Loading