I2HCM: A Pluggable Jailbreak Pipeline for Enhancing Harmful Questions

I2HCM: A Pluggable Jailbreak Pipeline for Enhancing Harmful Questions

ACL ARR 2025 May Submission6174 Authors

20 May 2025 (modified: 29 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: As the capabilities of large language models (LLMs) improve, their safety has garnered increasing attention. In this paper, we introduce Iterative Internal Harmful Content Mining (I$^2$HCM), an automatic pluggable jailbreak pipeline for enhancing harmful questions for black-box models, revealing that previous large language models can be a deeply hidden evil doctor. Unlike previous methods, I$^2$HCM does not require complex jailbreak template construction methods or question resolution strategies. It merely leverages the model's responses to mine harmful knowledge inside the model. Starting with a simple harmful question, our method mines, refines and utilizes the content from each turn of the model's response, gradually guiding the model to generate a more complex harmful question, which can easily bypass the defense mechanisms of large language models. Our method has achieved significant attack success rates (ASR) with high efficiency in many black-box models on different attack methods. Our method can not only be used as an independent jailbreak pipeline, but also be immediately embedded in many jailbreak pipelines and provides a new perspective for the construction of the safety alignment dataset.

Paper Type: Long

Research Area: Language Modeling

Research Area Keywords: prompting;,safety and alignment,red teaming,security and privacy

Contribution Types: Theory

Languages Studied: English

Submission Number: 6174

Loading