Circumventing Safety Alignment in Large Language Models via Embedding Space Toxicity Attenuation

Zhibo Zhang; Yuxi Li; Shuai Yuan; Tianlong Yu; Ling Shi; Kailong Wang

Circumventing Safety Alignment in Large Language Models via Embedding Space Toxicity Attenuation

Zhibo Zhang, Yuxi Li, Shuai Yuan, Tianlong Yu, Ling Shi, Kailong Wang

17 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Model

Abstract: Large Language Models (LLMs), especially open-source LLMs, have achieved remarkable success across various critical domains. However, their open nature also inadvertently introduces significant security risks, particularly through embedding space poisoning. While previous research has investigated universal perturbation methods, the dynamics of LLM safety alignment at the embedding level remain insufficiently understood despite their potential severity. We propose **ETTA (Embedding Transformation Toxicity Attenuation)**, a novel framework that identifies and attenuates toxicity-sensitive dimensions in embedding space via linear transformations. ETTA bypasses model refusal behaviors while preserving linguistic coherence, without requiring model fine-tuning or access to training data. Evaluated on five representative open-source LLMs, ETTA achieves a high average attack success rate of **88.61\%**, outperforming the best baseline by **11.34\%**, and generalizes to safety-enhanced models (e.g., 77.39\% ASR on instruction-tuned defenses). These results highlight a critical vulnerability in current alignment strategies and the need for embedding-aware defenses.

Supplementary Material: zip

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Submission Number: 8949

Loading