SafeVacuo: Jailbreaking Open-Source LLMs via Activation Perturbations

SafeVacuo: Jailbreaking Open-Source LLMs via Activation Perturbations

ACL ARR 2025 February Submission7054 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Open-source large language models (LLMs) are increasingly narrowing the performance gap with proprietary LLMs, driving a surge in both their popularity and applications. To mitigate misuse, substantial safety alignment efforts have been made prior to model release. However, even meticulously aligned LLMs remain vulnerable to various types of jailbreak attacks, which may be launched through malicious adversarial prompts or altered decoding strategies. The aim of these attacks is to achieve greater attack capabilities with lower computational costs by fully exploiting the white-box nature of open-source LLMs. In this paper, we uncover a novel safety vulnerability that has not yet been exploited by existing white-box jailbreak methods. Specifically, we discover that injecting perturbations into the activations of LLMs can undermine their safety alignment. Building on this insight, we propose a new jailbreak attack based on activation perturbations, which optimizes the positions of the injected noise without negatively affecting the perplexity of the victim LLM. The malicious user only needs to inject random noise into the optimized positions with minimal computational cost, while inducing the model to produce high-quality yet harmful outputs. Our experiments, extensively conducted across 10 state-of-the-art open-source LLMs, show that this approach achieves higher success rates than previous methods while preserving model utility. The analysis further indicates that targeted activation perturbations can effectively bypass safety measures in aligned models, revealing critical limitations in current safety alignment strategies. The code for this work is available at https://anonymous.4open.science/r/acttacker.

Paper Type: Long

Research Area: Language Modeling

Research Area Keywords: red teaming, security, robustness, ethical considerations in NLP applications

Contribution Types: Model analysis & interpretability, NLP engineering experiment

Languages Studied: English

Submission Number: 7054

Loading