Abstract: Deep neural networks (DNNs) are vulnerable to
adversarial noise. Their adversarial robustness
can be improved by exploiting adversarial examples. However, given the continuously evolving attacks, models trained on seen types of adversarial
examples generally cannot generalize well to unseen types of adversarial examples. To solve this
problem, in this paper, we propose to remove adversarial noise by learning generalizable invariant
features across attacks which maintain semantic
classification information. Specifically, we introduce an adversarial feature learning mechanism
to disentangle invariant features from adversarial
noise. A normalization term has been proposed in
the encoded space of the attack-invariant features
to address the bias issue between the seen and
unseen types of attacks. Empirical evaluations
demonstrate that our method could provide better
protection in comparison to previous state-of-theart approaches, especially against unseen types of
attacks and adaptive attacks
0 Replies
Loading