Learning to Instruct with Implicit Harmfulness: Transferable Black-Box Jailbreak on Large Language Models
Abstract: As Large Language Models (LLMs) are widely applied in various domains, the safety of LLMs is increasingly attracting attention to avoid their powerful capabilities being misused. Existing black-box jailbreak studies manually design or automatically search adversarial prompts with prefix or suffix words, guided by the finding that even character-level rewriting to the original prompts can induce completely different responses. Some other studies adopt jailbreak templates to wrap the harmfulness by creating some scenarios. However, they suffer from low efficiency and explicit jailbreak patterns, far from the real deployment of mass attacks to LLMs. In this paper, we propose TB$^3$, which is a $\textbf{T}$ransferable $\textbf{B}$lack-$\textbf{B}$ox jail$\textbf{B}$reak method to attack LLMs by iteratively exploring the weakness of the LLMs and automatically improving the attacking strategy. Without requiring manually designed prompts or templates of prefixes or suffixes, the jailbreak is more efficient and hard to identify. Extensive experiments and analysis demonstrate the effectiveness of TB$^3$, and we find that the jailbreak is also transferable to multiple datasets and various types of models with only a few queries. We hope our work motivates further investigation of LLM safety. The code can be found at https://anonymous.4open.science/r/TB3/.
Paper Type: Long
Research Area: Interpretability and Analysis of Models for NLP
Research Area Keywords: adversarial attacks/examples/training, red teaming
Contribution Types: Model analysis & interpretability
Languages Studied: English
Submission Number: 1401
Loading