Abstract: As the capabilities of large language models (LLMs) improve, their safety has garnered increasing attention. In this paper, we introduce Iterative Content Mining (ICM), an automatic jailbreak pipeline for black-box models, revealing that previous large language models can be a deeply hidden evil doctor. Unlike previous methods, ICM does not require complex jailbreak template construction methods or question resolution strategies. It merely leverages the model's responses to mine harmful knowledge inside the model. Starting with a simple harmful question, our method mines and refines content from each turn of the model's response, gradually guiding the model to generate and respond to more complex harmful questions, which can easily bypass the defense mechanisms of large language models. Our method has achieved significant attack success rates (ASR) with high efficiency in many black-box models, both open-source and closed-source models, 84\% on Qwen-Turbo, 88\% on ERNIE-4.0-Turbo, 88\% on GPT-4-Turbo and 92\% on Qwen-2.5-7b under 10 queries. This method surpasses previous automatic, black-box and interpretable jailbreak pipelines and provides a new perspective for the future jailbreak research.
Paper Type: Short
Research Area: Language Modeling
Research Area Keywords: prompting,security and privacy,red teaming
Contribution Types: Theory
Languages Studied: English,Chinese
Submission Number: 5629
Loading