Turning a Curse into a Blessing: Enabling In-Distribution-Data-Free Backdoor Removal via Stabilized Model Inversion

Si Chen; Yi Zeng; Won Park; Jiachen T. Wang; Xun Chen; Lingjuan Lyu; Zhuoqing Mao; Ruoxi Jia

Turning a Curse into a Blessing: Enabling In-Distribution-Data-Free Backdoor Removal via Stabilized Model Inversion

Si Chen, Yi Zeng, Won Park, Jiachen T. Wang, Xun Chen, Lingjuan Lyu, Zhuoqing Mao, Ruoxi Jia

Published: 21 Sept 2023, Last Modified: 17 Sept 2024Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: The effectiveness of many existing techniques for removing backdoors from machine learning models relies on access to clean in-distribution data. However, given that these models are often trained on proprietary datasets, it may not be practical to assume that in-distribution samples will always be available. On the other hand, model inversion techniques, which are typically viewed as privacy threats, can reconstruct realistic training samples from a given model, potentially eliminating the need for in-distribution data. To date, the only prior attempt to integrate backdoor removal and model inversion involves a simple combination that produced very limited results. This work represents a first step toward a more thorough understanding of how model inversion techniques could be leveraged for effective backdoor removal. Specifically, we seek to answer several key questions: What properties must reconstructed samples possess to enable successful defense? Is perceptual similarity to clean samples enough, or are additional characteristics necessary? Is it possible for reconstructed samples to contain backdoor triggers? We demonstrate that relying solely on perceptual similarity is insufficient for effective defenses. The stability of model predictions in response to input and parameter perturbations also plays a critical role. To address this, we propose a new bi-level optimization based framework for model inversion that promotes stability in addition to visual quality. Interestingly, we also find that reconstructed samples from a pre-trained generator's latent space do not contain backdoors, even when signals from a backdoored model are utilized for reconstruction. We provide a theoretical analysis to explain this observation. Our evaluation shows that our stabilized model inversion technique achieves state-of-the-art backdoor removal performance without requiring access to clean in-distribution data. Furthermore, its performance is on par with or even better than using the same amount of clean samples.

Submission Length: Regular submission (no more than 12 pages of main content)

Changes Since Last Submission: Camera ready version. Changes include: - Removed red annotations and adapted added changes into the context. - Moved ‘adaptive attack’ to appendix (section F) and updated the outline at the beginning of Evaluation section. - Adjusted figure 8. - Adjust equations and tables (position, size, spaces) . - Typos fixed. - Added authors and other information.

Code: https://github.com/SCccc21/FRED.git

Supplementary Material: zip

Assigned Action Editor: ~Sebastian_U_Stich1

License: Creative Commons Attribution 4.0 International (CC BY 4.0)

Submission Number: 991

Loading