Parameter Drift as a Signal for Membership Inference in Overfit-Tuned LLMs

Takuto Kitamura, Yu Suzuki

Published: 2025, Last Modified: 28 May 2026DaWaK 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: We propose a novel white-box membership inference attack (MIA) for large language models (LLMs) that leverages internal parameter dynamics to determine whether a given text sample was included in a model’s pre-training data. Prior MIA approaches rely primarily on input-output behavior and struggle to distinguish memorized samples from semantically similar but unseen ones due to the probabilistic nature of LLM generation. To address this challenge, we introduce a method based on parameter drift—defined as the Euclidean distance between the entire set of model parameters before and after continual pre-training on a single input. Our hypothesis is that continual pre-trained inputs induce minimal parameter changes, while unseen inputs require greater updates to the model. Notably, we find that even semantically similar inputs yield distinct drift magnitudes, enabling more precise membership inference. We validate our approach on multiple LLMs, including Pythia and LLaMA-2, and show that it consistently outperforms existing MIA baselines such as Min-K% Prob and SaMIA*zlib under various evaluation settings. Furthermore, we demonstrate that focusing on parameters with high drift further improves inference accuracy, achieving state-of-the-art results on benchmark datasets.
Loading