Learning to Instruct with Implicit Harmfulness: Transferable Black-Box Jailbreak on Large Language Models

Learning to Instruct with Implicit Harmfulness: Transferable Black-Box Jailbreak on Large Language Models

ACL ARR 2024 June Submission1401 Authors

14 Jun 2024 (modified: 17 Dec 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: As Large Language Models (LLMs) are widely applied in various domains, the safety of LLMs is increasingly attracting attention to avoid their powerful capabilities being misused. Existing black-box jailbreak studies manually design or automatically search adversarial prompts with prefix or suffix words, guided by the finding that even character-level rewriting to the original prompts can induce completely different responses. Some other studies adopt jailbreak templates to wrap the harmfulness by creating some scenarios. However, they suffer from low efficiency and explicit jailbreak patterns, far from the real deployment of mass attacks to LLMs. In this paper, we propose TB$^3$, which is a $\textbf{T}$ransferable $\textbf{B}$lack-$\textbf{B}$ox jail$\textbf{B}$reak method to attack LLMs by iteratively exploring the weakness of the LLMs and automatically improving the attacking strategy. Without requiring manually designed prompts or templates of prefixes or suffixes, the jailbreak is more efficient and hard to identify. Extensive experiments and analysis demonstrate the effectiveness of TB$^3$, and we find that the jailbreak is also transferable to multiple datasets and various types of models with only a few queries. We hope our work motivates further investigation of LLM safety. The code can be found at https://anonymous.4open.science/r/TB3/.

Paper Type: Long

Research Area: Interpretability and Analysis of Models for NLP

Research Area Keywords: adversarial attacks/examples/training, red teaming

Contribution Types: Model analysis & interpretability

Languages Studied: English

Submission Number: 1401

Loading