Keywords: Prompt Extraction, Large Language Models
Abstract: The drastic increase of large language model (LLM) parameters has led to a new research direction of fine-tuning-free downstream
customization by designing prompts. While these prompt-based agents play an important role in many businesses, there has emerged growing concerns about the prompt leakage, which undermines the intellectual properties of these services and causes downstream attacks. In this paper, we analyze the underlying mechanisms of prompt leakage. By exploring the scaling laws in prompt extraction, we analyze key attributes that influence prompt extraction, including model sizes, prompt lengths, as well as prompt types. Besides, we propose two hypotheses to explain how LLMs expose their prompts. The first is attributed to the perplexity, i.e., the familiarity of LLMs with texts, whereas the second is based on the straightforward token translation paths in attention matrices. To defend against such threats, we investigate whether alignments can mitigate the extraction of prompts. We find that current LLMs, even those with safety alignments, are highly vulnerable to prompt extraction attacks, even under the most straightforward user attacks. Therefore, we propose several defense strategies with the inspiration of our findings, which achieve almost 71.0% drop in the prompt extraction rate. Our source code is available at https://anonymous.4open.science/r/PromptExtractionEval-C6B7/.
Paper Type: Long
Research Area: Ethics, Bias, and Fairness
Research Area Keywords: adversarial attacks, knowledge tracing
Contribution Types: Model analysis & interpretability, NLP engineering experiment
Languages Studied: English
Submission Number: 3980
Loading