AlignFix: Fixing Adversarial Perturbations by Agreement Checking for Adversarial Robustness against Black-box Attacks
Abstract: Motivated by the vulnerability of feed-forward visual pathways to adversarial-like inputs and the overall robustness of biological perception, commonly attributed to top-down feedback processes, we propose a new defense method AlignFix. We exploit the fact that natural and adversarially trained models rely on distinct feature sets for classification. Notably, naturally trained models, referred to as \textit{weakM}, retain commendable accuracy against adversarial examples generated using adversarially trained models referred to as \textit{strongM}, and vice-versa. Further these two models tend to agree more on their prediction if input is nudged towards correct class prediction. Leveraging this, AlignFix initially perturbs the input toward the class predicted by a naturally trained model, using a joint loss from both \textit{weakM} and \textit{strongM}. If this retains or leads to agreement, the prediction is accepted, otherwise the original \textit{strongM} output is used. This mechanism is highly effective against leading SQA (Score-based Query Attacks) as well as decision-based and transfer-based black-box attacks. We demonstrate its effectiveness through comprehensive experiments across various datasets (CIFAR and ImageNet) and architectures (ResNet and ViT).
Submission Length: Regular submission (no more than 12 pages of main content)
Code: https://github.com/aknirala/AlignFix
Supplementary Material: zip
Assigned Action Editor: ~Pin-Yu_Chen1
Submission Number: 5028
Loading