Abstract: Large language models (LLMs) are vulnerable to training data extraction attacks due to data memorization. This paper introduces a novel attack scenario wherein an attacker adversarially fine-tunes pre-trained LLMs to amplify the exposure of the original training data. Unlike prior TDE methods that mainly rely on post-hoc querying or prompt selection to elicit memorized content from a fixed model, our strategy directly alters the model’s parameters to intensify its retention of the pre-training dataset. To achieve this, the attacker needs to collect generated texts that are closely aligned with the pre-training data. However, without knowledge of the actual dataset, quantifying the amount of pre-training data within generated texts is challenging. To address this, we propose the use of pseudo-labels for these generated texts, leveraging membership approximations indicated by machine-generated probabilities from the target LLMusing DetectGPT. We subsequently fine-tune the LLM via reinforcement learning from human feedback (RLHF) to favor generations with higher likelihoods of originating from the pre-training data, based on these membership probabilities. Our empirical findings indicate a remarkable outcome: LLMs with over 1B parameters exhibit a four to eight-fold increase in training data exposure. We discuss potential mitigations and suggest future research directions.
External IDs:doi:10.1109/tifs.2025.3613882
Loading