SOS! Soft Prompt Attack Against Open-Source Large Language Models

Ziqing Yang; Michael Backes; Yang Zhang; Ahmed Salem

SOS! Soft Prompt Attack Against Open-Source Large Language Models

Ziqing Yang, Michael Backes, Yang Zhang, Ahmed Salem

10 Sept 2025 (modified: 17 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: security/privacy, large language models, backdoor attack, jailbreak, prompt stealing attack

Abstract: With the development of large language models (LLMs), their widespread use raises severe security and privacy concerns. However, existing attacks mainly target the LLM model and its input/output space, while the vulnerabilities on the token-embedding layer remain underexplored. In this work, we target the token-embedding layer and propose SOS, an adaptable framework that operates without requiring clean data or modifying the core transformer block weights, ensuring minimal computational overhead and preserving model utility. Experiments demonstrate the efficacy of our SOS across different attack objectives, including backdoor, jailbreak, and prompt stealing attacks. Furthermore, we explore its dual potential to safeguard copyrighted content and protect LLM's intellectual property. Our work highlights both vulnerabilities and opportunities in securing LLMs.

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Submission Number: 3714

Loading