Early Stop And Adversarial Training Yield Better surrogate Model: Very Non-Robust Features Harm Adversarial Transferability
Keywords: Adversarial Tranferability, Early stop, Adversarial Training, Non-robust Features
Abstract: The transferability of adversarial examples (AE); known as adversarial transferability, has attracted significant attention because it can be exploited for TransferableBlack-box Attacks (TBA). Most lines of works attribute the existence of the non-robust features improves the adversarial transferability. As a motivating example, we test the adversarial transferability on the early stopped surrogate models, which are known to be concentrated on robust features than non-robust features from prior works. We find that the early stopped models yield better adversarial transferability than the models at the final epoch, which leaves non-intuitive interpretation from the perspective of the robust and non-robust features (NRFs). In this work, we articulate a novel Very Non-Robust Feature(VNRF) hypothesis that the VNRFslearned can harm the adversarial transferability to explain this phenomenon. This hypothesis is partly verified through zeroing some filters with highl1norm values. This insight further motivates us to adopt light adversarial training that mainly removes the VNRFs for significantly improving the transferability.
One-sentence Summary: We propose early stopping and adversarial training to improve adversarial transferability, with introducing Very Non-Robust Feature Hypothesis as an explanation for it.
Supplementary Material: zip
6 Replies
Loading