Firm Foundations for Membership Inference Attacks Against Large Language Models

Jeffrey George Wang; Jason Wang; Marvin Li; Seth Neel

Firm Foundations for Membership Inference Attacks Against Large Language Models

Jeffrey George Wang, Jason Wang, Marvin Li, Seth Neel

Published: 10 Jun 2025, Last Modified: 13 Jul 2025DIG-BUG ShortEveryoneRevisionsBibTeXCC BY 4.0

Keywords: membership inference, language models, privacy, security, evaluations

TL;DR: We propose the first pipeline for principled evaluation of membership inference attacks against LLMs.

Abstract: Membership inference attacks (MIAs) are a canonical way to assess a machine learning model's privacy properties. While many approaches have been proposed for conducting MIAs on language models, the extant literature has suffered numerous difficulties in constructing clean evaluations to test new techniques. In particular, subtle distribution shifts between member and non-member sets can completely change performance; recent work has underscored this by showing that ``blind" methods with no access to the underlying model can perform far better than published methods on the same benchmarks. In this paper, we propose the first pipeline for principled evaluation of membership inference attacks against LLMs. Our approach leverages the insight that training data before and after a fixed point during training are drawn from the same distribution with minimal contamination; therefore, all open-source models with intermediate checkpoints and public training data are membership inference testbeds. We apply our framework to a half-dozen published attacks on the Pythia and OLMo family of models, from 70M to 7B parameters. To facilitate further privacy research, we open-source a modular library for designing and implementing attacks in this setting: https://github.com/safr-ai-lab/pandora_llm.

Submission Number: 16

Loading