Keywords: Model Stealing, Model Extraction, Soft prompt tuning, soft prompts, last layer extraction, defenses, LLMs
TL;DR: We show that adversaries can extract functionally equivalent soft prompts from prompt-tuned LLMs and introduce CAP, a defense that impairs such extraction while preserving task performance.
Abstract: Soft prompt tuning has emerged as a powerful and automated approach for adapting large language models (LLMs) to new tasks, eliminating the need for manual prompt engineering. The practical relevance of soft prompts is underscored by their support in major toolkits and APIs such as NVIDIA NeMo and IBM Watsonx AI. However, as soft prompts encode valuable, task-specific information, they have become attractive targets for adversarial extraction. In this work, we demonstrate that attackers can extract functionally equivalent soft prompts from prompt-tuned LLMs, effectively replicating their capabilities without access to the original training data or resources. By training a dedicated inversion model, we show that such extraction generalizes, enabling recovery of soft prompts for any downstream task on the given model. To counter this threat, we introduce CAP (**C**overage-**A**ware **P**erturbation), an active defense that substantially impairs extraction while maintaining task performance for legitimate use. Our framework highlights both new risks and practical solutions, paving the way for more trustworthy deployment of adapted LLMs.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 1529
Loading