Keywords: Soft prompting, LLMs, Interpretability, Prompting
TL;DR: We show that learnt soft prompts differ greatly in the model embedding space from natural tokens and argue this leads to corresponding safety concerns.
Abstract: Prompt tuning, or "soft prompting," replaces text prompts to generative models with learned embeddings (i.e. vectors) and is used as an alternative to parameter-efficient fine-tuning. Prior work suggests analyzing soft prompts by interpreting them as natural language prompts. However, we find that soft prompts occupy regions in the embedding space that are distinct from those containing natural language, meaning that direct comparisons may be misleading. We argue that because soft prompts are currently uninterpretable, they could potentially be a source of vulnerability of LLMs to malicious manipulations during deployment.
Submission Number: 56
Loading