Private Investigator: Extracting Personally Identifiable Information from Large Language Models Using Optimized Prompts
Abstract: Recent studies on training data extraction attacks have demonstrated significant threats to the language model ecosystem.
In a typical machine learning deployment scenario where a
pre-trained language model is fine-tuned on users’ private
data, an adversary may attempt to leak personally identifiable information (PII) memorized by the fine-tuned model.
Prior work has demonstrated this privacy risk by inducing a
model to output PII in response to handcrafted or outsourced
prompts. However, little attention has been given to how a
smart adversary will design optimal prompts for successful
PII extraction.
In this work, we address this knowledge gap. We propose
Private Investigator, an attack framework designed to optimize
prompts for querying a target language model to extract PII
used for its fine-tuning process. We propose a new prompt generation method that aims to craft promising prompts, which
induce the target language model to emit as many PII items
as possible by exploring diverse contexts. Private Investigator
then exploits these generated prompts to conduct extraction
attacks. To this end, we develop a prompt selection strategy
that prioritizes the most promising prompts for successful
PII extraction, taking full advantage of each extraction attack
opportunity. In evaluation, we demonstrate that Private Investigator extracts up to 1,254 more email addresses, 634 more
phone numbers, and 5,087 more personal names, outperforming existing attacks in extracting PII items.
Loading