Keywords: Large Language Model
Abstract: Large Language Models (LLMs), especially open-source LLMs, have achieved remarkable success across various critical domains. However, their open nature also inadvertently introduces significant security risks, particularly through embedding space poisoning. While previous research has investigated universal perturbation methods, the dynamics of LLM safety alignment at the embedding level remain insufficiently understood despite their potential severity.
We propose **ETTA (Embedding Transformation Toxicity Attenuation)**, a novel framework that identifies and attenuates toxicity-sensitive dimensions in embedding space via linear transformations. ETTA bypasses model refusal behaviors while preserving linguistic coherence, without requiring model fine-tuning or access to training data. Evaluated on five representative open-source LLMs, ETTA achieves a high average attack success rate of **88.61\%**, outperforming the best baseline by **11.34\%**, and generalizes to safety-enhanced models (e.g., 77.39\% ASR on instruction-tuned defenses). These results highlight a critical vulnerability in current alignment strategies and the need for embedding-aware defenses.
Supplementary Material: zip
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 8949
Loading