Keywords: Large Language Model, LLM Security
Abstract: Large Language Models (LLMs) are increasingly distributed and deployed through public platforms such as Hugging Face. While these platforms provide basic security scanning, they often overlook subtle manipulations within the embedding layer that could lead to harmful behaviors during inference. We observed a Semantic Shift phenomenon in embedding perturbations, exposing potential security threats. Based on further analysis of this phenomenon, we propose Search-based Embedding Poisoning (SEP), a practical, systematic, and model-agnostic framework that bypasses model safety alignment by introducing carefully chosen perturbations into embeddings associated with high-risk tokens. SEP employs heuristic search to identify subtle perturbations based on linear semantic shift, dynamically adjusting search process according to model outputs until evading safety alignment. SEP achieves an average attack success rate of 96.43\% while preserving benign functionality and evading conventional detection mechanisms. Our findings highlight an overlooked attack surface in the deployment pipeline and call for embedding-level integrity checks as the core of future LLM defense strategies.
Supplementary Material: zip
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 9128
Loading