A Data-Centric Safety Framework for Generative Models: Adversarial Fingerprint Detection and Attribution
Keywords: Contamination Detection, Memorization Attribution, Token-level Fingerprints, Contrastive Learning, Post-hoc Data Removal, Model Forensics, Trustworthy AI, Data-centric Safety
Abstract: Generative models have revolutionized applications from text synthesis to image creation, yet their safety and trustworthiness are undermined by unintended memorization and data contamination. Existing detection methods—relying on output similarity or final-layer embeddings—either lack instance-level precision or fail to provide actionable attributions. To address these limitations, we propose \textbf{FPGuard}, a Data-Centric Safety Framework that performs Adversarial Fingerprint Detection and Attribution. FPGuard extracts token-level fingerprints from intermediate hidden states, constructs a scalable fingerprint bank from training data, and employs contrastive learning to enhance discriminability. At test time, FPGuard computes a contamination score by aggregating top-$k$ cosine similarities between test and banked fingerprints, and generates fine-grained attribution maps that identify the exact training instances responsible. Moreover, FPGuard enables post-hoc detoxification through targeted data removal, significantly reducing contamination effects. Experiments on LLaMA-2-7B and GPT-J under synthetic (SQuAD$\rightarrow$Pile) and natural (RedPajama$\rightarrow$TriviaQA) contamination settings show that FPGuard improves detection \textbf{Precision@10} by up to 25\%, enhances attribution precision by over 30--45\%, and lowers contamination scores by up to 43\% compared to prior baselines—all without retraining.
Submission Number: 56
Loading