Abstract: Deep Neural Networks are shown to be prone to adversarial attacks. In the black-box setting, where no information about the target is available, surrogate-based black-box attacks train a surrogate on samples queried from the target to imitate the black-box’s behavior. The trained surrogate is then attacked to generate adversarial examples. Existing surrogate-based attacks suffer from low success rates because they fail to accurately capture the target’s behavior, i.e., their surrogates only mimic the target’s outputs for a given set of inputs. Moreover, their attack strategy relies on noisy estimations of high dimensional gradients w.r.t. the inputs (i.e., surrogate’s gradients) to generate adversarial examples. Ideally, a successful surrogate-based attack should possess two properties: (1) Train and employ a surrogate that accurately imitates the target behavior for every pair of input and output, i.e., the joint distribution of the target over its input and outputs; and (2) Generate adversarial examples by directly manipulating the class-dependent factors of the input, i.e., factors that affect the target’s output, rather than relying on noisy estimations of gradients. We propose a novel surrogate-based attack framework with a surrogate architecture that learns the target distribution over its inputs and outputs while disentangling the class-dependent factors from class-irrelevant ones. The framework is equipped with a novel attack strategy that fully utilizes the target distribution captured by the surrogate while generating adversarial examples by directly manipulating the class-dependent factors. Extensive experiments demonstrate the efficacy of our attack in generating highly successful adversarial examples compared to state-of-the-art methods.
0 Replies
Loading