Evading Safety Alignment in Open-source LLM Deployment by Embedding Semantic Shift

Shuai Yuan; Zhibo Zhang; Yuxi Li; Guangdong Bai; Kailong Wang

Evading Safety Alignment in Open-source LLM Deployment by Embedding Semantic Shift

Shuai Yuan, Zhibo Zhang, Yuxi Li, Guangdong Bai, Kailong Wang

17 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Model, LLM Security

Abstract: Large Language Models (LLMs) are increasingly distributed and deployed through public platforms such as Hugging Face. While these platforms provide basic security scanning, they often overlook subtle manipulations within the embedding layer that could lead to harmful behaviors during inference. We observed a Semantic Shift phenomenon in embedding perturbations, exposing potential security threats. Based on further analysis of this phenomenon, we propose Search-based Embedding Poisoning (SEP), a practical, systematic, and model-agnostic framework that bypasses model safety alignment by introducing carefully chosen perturbations into embeddings associated with high-risk tokens. SEP employs heuristic search to identify subtle perturbations based on linear semantic shift, dynamically adjusting search process according to model outputs until evading safety alignment. SEP achieves an average attack success rate of 96.43\% while preserving benign functionality and evading conventional detection mechanisms. Our findings highlight an overlooked attack surface in the deployment pipeline and call for embedding-level integrity checks as the core of future LLM defense strategies.

Supplementary Material: zip

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Submission Number: 9128

Loading