Keywords: Sense-Level Vulnerabilities, Large Language Models, Polysemy
Abstract: Semantic sensitivity is a double‑edged sword. It refers to the capacity of large language models (LLMs) to discern fine‑grained meaning, which enables advanced reasoning and carries security risks that remain underexplored. Prior studies on LLM vulnerabilities mainly focus on attacks triggered by explicit lexical or structural patterns, implicitly assuming malicious activation is surface identifiable. We challenge this assumption by revealing polysemy as a new and stealthy threat surface, where specific word senses can serve as covert triggers. Such triggers activate malicious behavior only in their target sense while remaining inert otherwise, which fundamentally differs from prior attacks and evades conventional defenses designed for surface-level cues. To systematically investigate this risk, we introduce Sense-Aware Backdoor attack (SAB), a model editing framework that combines contrastive learning with orthogonal projection-based editing to isolate a discriminative sense subspace and confine malicious behavior within the target sense subspace, achieving strict activation selectivity with limited data. Extensive experiments across four benchmarks show that SAB achieves a high attack success rate under the target sense while maintaining minimal to zero activation on non-target senses. Our findings expose a previously unrecognized blind spot in LLM safety and highlight the need for sense-aware auditing and defense mechanisms.
Paper Type: Long
Research Area: Safety and Alignment in LLMs
Research Area Keywords: adversarial attacks, model editing
Contribution Types: Model analysis & interpretability, Data resources
Languages Studied: English
Submission Number: 8080
Loading