Effective Prompt Extraction from Language Models

Published: 10 Jul 2024, Last Modified: 26 Aug 2024COLMEveryoneRevisionsBibTeXCC BY 4.0
Research Area: Safety
Keywords: prompt extraction, safety
TL;DR: Large language model prompts, often kept secret, can be extracted by adversaries using text-based attacks, as demonstrated in experiments across various sources and models, indicating vulnerabilities in real systems like ChatGPT and Claude.
Abstract: The text generated by large language models is commonly controlled by prompting, where a prompt prepended to a user’s query guides the model’s output. The prompts used by companies to guide their models are often treated as secrets, to be hidden from the user making the query. They have even been treated as commodities to be bought and sold on market- places. However, anecdotal reports have shown adversarial users employ- ing prompt extraction attacks to recover these prompts. In this paper, we present a framework for systematically measuring the effectiveness of these attacks. In experiments with 3 different sources of prompts and 11 underly- ing large language models, we find that simple text-based attacks can in fact reveal prompts with high probability. Our framework determines with high precision whether an extracted prompt is the actual secret prompt, rather than a model hallucination. Prompt extraction from real systems such as Claude 3 and ChatGPT further suggest that system prompts can be revealed by an adversary despite existing defenses in place.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the COLM Code of Ethics on https://colmweb.org/CoE.html
Author Guide: I certify that this submission complies with the submission instructions as described on https://colmweb.org/AuthorGuide.html
Submission Number: 972
Loading