Learning to Generate Inversion-Resistant Model Explanations

Hoyong Jeong; Suyoung Lee; Sung Ju Hwang; Sooel Son

Learning to Generate Inversion-Resistant Model Explanations

Hoyong Jeong, Suyoung Lee, Sung Ju Hwang, Sooel Son

Published: 31 Oct 2022, Last Modified: 12 Jan 2023NeurIPS 2022 AcceptReaders: Everyone

Keywords: model inversion defense, model explanation, explainable AI

Abstract: The wide adoption of deep neural networks (DNNs) in mission-critical applications has spurred the need for interpretable models that provide explanations of the model's decisions. Unfortunately, previous studies have demonstrated that model explanations facilitate information leakage, rendering DNN models vulnerable to model inversion attacks. These attacks enable the adversary to reconstruct original images based on model explanations, thus leaking privacy-sensitive features. To this end, we present Generative Noise Injector for Model Explanations (GNIME), a novel defense framework that perturbs model explanations to minimize the risk of model inversion attacks while preserving the interpretabilities of the generated explanations. Specifically, we formulate the defense training as a two-player minimax game between the inversion attack network on the one hand, which aims to invert model explanations, and the noise generator network on the other, which aims to inject perturbations to tamper with model inversion attacks. We demonstrate that GNIME significantly decreases the information leakage in model explanations, decreasing transferable classification accuracy in facial recognition models by up to 84.8% while preserving the original functionality of model explanations.

TL;DR: We propose the first defense framework that mitigates explanation-aware model inversion attacks by teaching a model to suppress inversion-critical features in a given explanation while preserving its functionality.

Supplementary Material: zip

15 Replies

Loading