On the Privacy Risks of Post-Hoc Explanations of Foundation Models

Published: 03 Jul 2024, Last Modified: 03 Jul 2024ICML 2024 FM-Wild Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Machine Learning, Explainability, Interpretability, Post-Hoc Explanations, Privacy, Data Privacy, Foundation Models, Vision Transformers, Deep Learning, Trustworthy ML, Membership Inference Attacks, Adversarial ML
TL;DR: We reveal unforeseen vulnerabilities of post-hoc model explanations to membership inference by introducing two novel attacks; we show on vision transformers that these attacks are more successful than existing attacks that leverage explanations.
Abstract: Foundation models are becoming increasingly deployed in high-stakes contexts in fields such as medicine, finance, and law. In these contexts, there is a trade-off between model explainability and data privacy: explainability promotes transparency, and privacy is a limit on transparency. In this work, we push the boundaries of this trade-off: with a focus on vision transformers for image classification fine-tuning, we reveal unforeseen privacy risks of post-hoc feature attribution explanations. We construct VAR-LRT and L1/L2-LRT, two novel membership inference attacks based on feature attribution explanations that are significantly more successful than existing attacks, particularly in the low false-positive rate regime that allows an adversary to identify specific training set members with high confidence. We carry out a rigorous empirical analysis with 2 novel attacks, 5 vision transformer architectures, 5 benchmark datasets, and 4 state-of-the-art post-hoc explanation methods. Our work addresses the lack of trust in post-hoc explanation methods that has contributed to the slow adoption of foundation models in high-stakes domains.
Submission Number: 85
Loading