Large Language Model is a Deeply Hidden Evil Doctor

Large Language Model is a Deeply Hidden Evil Doctor

ACL ARR 2025 February Submission5629 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: As the capabilities of large language models (LLMs) improve, their safety has garnered increasing attention. In this paper, we introduce Iterative Content Mining (ICM), an automatic jailbreak pipeline for black-box models, revealing that previous large language models can be a deeply hidden evil doctor. Unlike previous methods, ICM does not require complex jailbreak template construction methods or question resolution strategies. It merely leverages the model's responses to mine harmful knowledge inside the model. Starting with a simple harmful question, our method mines and refines content from each turn of the model's response, gradually guiding the model to generate and respond to more complex harmful questions, which can easily bypass the defense mechanisms of large language models. Our method has achieved significant attack success rates (ASR) with high efficiency in many black-box models, both open-source and closed-source models, 84\% on Qwen-Turbo, 88\% on ERNIE-4.0-Turbo, 88\% on GPT-4-Turbo and 92\% on Qwen-2.5-7b under 10 queries. This method surpasses previous automatic, black-box and interpretable jailbreak pipelines and provides a new perspective for the future jailbreak research.

Paper Type: Short

Research Area: Language Modeling

Research Area Keywords: prompting,security and privacy,red teaming

Contribution Types: Theory

Languages Studied: English,Chinese

Submission Number: 5629

Loading