I2HCM: A Pluggable Jailbreak Pipeline for Enhancing Harmful Questions

ACL ARR 2025 May Submission6174 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: As the capabilities of large language models (LLMs) improve, their safety has garnered increasing attention. In this paper, we introduce Iterative Internal Harmful Content Mining (I$^2$HCM), an automatic pluggable jailbreak pipeline for enhancing harmful questions for black-box models, revealing that previous large language models can be a deeply hidden evil doctor. Unlike previous methods, I$^2$HCM does not require complex jailbreak template construction methods or question resolution strategies. It merely leverages the model's responses to mine harmful knowledge inside the model. Starting with a simple harmful question, our method mines, refines and utilizes the content from each turn of the model's response, gradually guiding the model to generate a more complex harmful question, which can easily bypass the defense mechanisms of large language models. Our method has achieved significant attack success rates (ASR) with high efficiency in many black-box models on different attack methods. Our method can not only be used as an independent jailbreak pipeline, but also be immediately embedded in many jailbreak pipelines and provides a new perspective for the construction of the safety alignment dataset.
Paper Type: Long
Research Area: Language Modeling
Research Area Keywords: prompting;,safety and alignment,red teaming,security and privacy
Contribution Types: Theory
Languages Studied: English
Submission Number: 6174
Loading