Abstract: The distance between two classes for a deep learning classifier can be measured by the level of difficulty in flipping all (or majority of) samples in a class to the other. The class distances of many pre-trained models in the wild are very small and do not align well with humans’ intuition (e.g., classes turtle and bird have smaller distance than classes cat and dog), making the models vulnerable to backdoor attacks, which aim to cause misclassification by stamping a specific pattern to inputs. We propose a novel model hardening technique called model orthogonalization which is an add-on training step to pretrained models, including clean models, poisoned models, and adversarially trained models. It can substantially enlarge class distances with reasonable training cost and without much accuracy degradation. Our evaluation on 5 datasets with 22 model structures show that our technique can enlarge class distances by 177.63% on average with less than 1% accuracy loss, outperforming existing hardening techniques such as adversarial training, universal adversarial perturbation, and directly using generated backdoors. It reduces 80% false positives for a state-of-the-art backdoor scanner as the enlarged class distances allow the scanner to easily distinguish clean and poisoned models, and substantially outperforms three existing techniques in removing injected backdoors.
0 Replies
Loading