Abstract: Security concerns for large language models (LLMs) have intensified, particularly regarding jailbreaking attempts via malicious inputs. Studying new jailbreak attacks can help with red teaming to secure the LLMs. For open-source LLMs, embedding-based attacks can achieve high effectiveness. However, existing embedding-based attacks only optimize the suffix of the prompt, leading to unnecessary complexity and rendering them easier to detect. We propose a novel attack method that directly manipulates entire LLM inputs without separating them into bodies and suffixes. However, manipulating entire LLM inputs faces the challenges of random or nonsensical repetitive responses. To address these challenges, we propose Clip, whose main strategy is to clip each input dimension based on the mean and standard deviation of the model vocabulary during model inference. Experiments show that Clip improves the attack success rate (ASR) of continuous embedding attacks with full LLM inputs from 62% to 83% for LLaMa and from 38% to 83 % for Vicuna.
External IDs:dblp:conf/sp/XuLDWLSP25
Loading