The Role of Counterfactual Explanations in Model Extraction Attacks

20 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: societal considerations including fairness, safety, privacy
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: model extraction, counterfactual explanations, decision boundary shift, polytope theory, query complexity
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: We propose novel strategies for incorporating counterfactual explanations in model extraction attacks with theoretical guarantees.
Abstract: Counterfactuals provide guidance on achieving a favorable outcome from a model, with minimum input perturbation. However, counterfactuals can also be exploited to leak information about the underlying model, causing privacy concerns. Prior work shows that one can query for counterfactuals with several input instances and train a surrogate model using all the queries and their counterfactuals. In this work, we analyze how model extraction attacks can be improved by further leveraging the fact that the counterfactuals also lie quite close to the decision boundary. Using polytope theory, we derive a novel theoretical relationship between the error in model approximation and the number of queries, when the queries exactly return the "closest" counterfactual. Noting the practicalities of counterfactual generation, we also provide additional theoretical guarantees leveraging Lipschitz continuity, that hold when the counterfactuals are reasonably close but may not be the closest ones. Our theoretical results help us arrive at a simple strategy for model extraction, which includes a loss function that treats counterfactuals differently than ordinary instances. Our approach also alleviates the related problem of "decision boundary shift". Experimental results demonstrate the performance of our strategy on synthetic data as well as popular real-world tabular datasets.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 2919
Loading