Alpaca against Vicuna: Using LLMs to Uncover Memorization of LLMs

Published: 09 Oct 2024, Last Modified: 03 Jan 2025Red Teaming GenAI Workshop @ NeurIPS'24 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: privacy, security, memorization
TL;DR: We proposes a new method to analyze memorization in large language models (LLMs) by using instruction-based prompts. Results show a substantial increase in training data overlap, suggesting potential vulnerabilities to automated attacks.
Abstract: In this paper, we introduce a black-box prompt optimization method that uses an attacker LLM agent to uncover higher levels of memorization in a victim agent, compared to what is revealed by prompting the target model with the training data directly, which is the dominant approach of quantifying memorization in LLMs. We use an iterative rejection-sampling optimization process to find instruction-based prompts with two main characteristics: (1) minimal overlap with the training data to avoid presenting the solution directly to the model, and (2)maximal overlap between the victim model's output and the training data, aiming to induce the victim to spit out training data. We observe that our instruction-based prompts generate outputs with 23.7% higher overlap with training data compared to the baseline prefix-suffix measurements. We analyze our attack in two settings: a practical approach with limited access to the sequence, excluding the suffix, and to demonstrate an empirical upper-bound scenario on the power of the attack where we have full sequence access but impose a penalty to discourage direct solutions. Our findings show that (1)instruction-tuned models can expose pre-training data as much as their base-models, if not more so, (2) contexts other than the original training data can lead to leakage, and (3) using instructions proposed by other LLMs can open a new avenue of automated attacks that we should further study and explore.
Serve As Reviewer: kassem6@uwindsor.ca
Submission Number: 27
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview