Keywords: security/privacy, large language models, backdoor attack, jailbreak, prompt stealing attack
Abstract: With the development of large language models (LLMs), their widespread use raises severe security and privacy concerns. However, existing attacks mainly target the LLM model and its input/output space, while the vulnerabilities on the token-embedding layer remain underexplored. In this work, we target the token-embedding layer and propose SOS, an adaptable framework that operates without requiring clean data or modifying the core transformer block weights, ensuring minimal computational overhead and preserving model utility. Experiments demonstrate the efficacy of our SOS across different attack objectives, including backdoor, jailbreak, and prompt stealing attacks. Furthermore, we explore its dual potential to safeguard copyrighted content and protect LLM's intellectual property. Our work highlights both vulnerabilities and opportunities in securing LLMs.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 3714
Loading