Activating and Probing: Deep Detection of Jailbreaking Prompts in Large Language Models

ACL ARR 2024 December Submission2171 Authors

16 Dec 2024 (modified: 05 Feb 2025)ACL ARR 2024 December SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Jailbreaking prompts detection, aimed at identifying harmful inputs to Large Language Models (LLMs), is crucial for the safety of LLMs. Existing studies can be broadly divided into two categories: prompt-based methods and LLM-feedback-based methods. The former utilizes a third-party evaluator to directly assess the toxicity of input prompts, the latter examines various forms of feedback generated by LLMs. However, both categories have limitations: prompt-based methods ignore the interactions between prompts and LLMs, resulting in inaccurate judgments, while LLM-feedback-based methods struggle to identify challenging jailbreak prompts that can bypass the safeguards of LLMs. To address these issues, we propose AcProb, a novel framework for jailbreaking prompts detection, with the core idea of enabling the detector to stand on the shoulders of LLMs. Specifically, we first activate the inherent value defense mechanism of LLMs by adding specialized suffixes to prompts, triggering subtle changes in their internal parameter states. Then, a CNN-based detector is designed to probe the parameter distributions of LLMs, effectively identifying whether an input prompt is a jailbreak. Extensive experiments on two public datasets demonstrate that AcProb outperforms state-of-the-art (SOTA) methods even when using only 10% of training data. Moreover, for the challenging jailbreak dataset we constructed, the AUPRC of AcProb is 25. 6\% absolutely higher than that of SOTA methods.
Paper Type: Long
Research Area: Ethics, Bias, and Fairness
Research Area Keywords: Large Language Models, Jailbreak attack, Jailbreaking prompts detection
Languages Studied: English
Submission Number: 2171
Loading