MoPe: Model Perturbation-based Privacy Attacks on Language Models

Jason Wang; Jeffrey Wang; Marvin Li; Seth Neel

MoPe: Model Perturbation-based Privacy Attacks on Language Models

Jason Wang, Jeffrey Wang, Marvin Li, Seth Neel

Published: 23 Oct 2023, Last Modified: 28 Nov 2023SoLaR PosterEveryoneRevisionsBibTeX

Keywords: large language model, membership inference

TL;DR: We demonstrate state-of-the-art membership inference attacks based on model perturbations against a suite of LLMs.

Abstract: Recent work has shown that Large Language Models (LLMs) can unintentionally leak sensitive information present in their training data. In this paper, we present MoPe ($\textbf{Mo}$del $\textbf{Pe}$rturbations), a new method to identify with high confidence if a given text is in the training data of a pre-trained language model, given white-box access to the models parameters. MoPe adds noise to the model in parameter space and measures the drop in the log-likelihood for a given point $x$, a statistic we show approximates the trace of the Hessian matrix with respect to model parameters. We compare MoPe to existing state-of-the-art loss-based attacks and other attacks based on second-order curvature information (such as the trace of the Hessian with respect to the model input). Across language models ranging from size $70$M to $12$B parameters, we show that MoPe is more effective than existing attacks. We also find that the loss of a point alone is insufficient to determine extractability---there are training points we can recover using our methods that have average loss. This casts some doubt on prior work that uses the loss of a point as evidence of memorization or "unlearning."

Submission Number: 91

Loading