Training Universal Adversarial Perturbations with Alternating Loss Functions

• Abstract: Despite being very successful, deep learning models were shown to be vulnerable to crafted perturbations. Furthermore, changing the prediction of a network over any image by learning a single universal adversarial perturbation (UAP) was shown to be possible. In this work, we propose 3 different ways of training UAPs that can attain a predefined fooling rate, while, in association, optimizing $L_2$ or $L_\infty$ norms. To stabilize around a predefined fooling rate, we have integrated an alternating loss function scheme that changes the current loss function based on a given condition. In particular, the loss functions we propose are: Batch Alternating Loss, Epoch-Batch Alternating Loss and Progressive Alternating Loss. In addition, we empirically observed that UAPs that were learned by minimization attacks contain strong image-like features around the edges, hence we propose integrating a circular masking operation to the training to further alleviate visible perturbations. The proposed $L_2$ Progressive Alternating Loss method outperforms the popular attacks by providing a higher fooling rate at equal $L_2$ norms. Furthermore Filtered Progressive Alternating Loss can further reduce the $L_2$ norm by 33.3% at the same fooling rate. When optimized with regards to $L_\infty$, Progressive Alternating Loss manages to stabilize on the desired fooling rate of 95% with only 1 percentage point of deviation, despite $L_\infty$ norm being particularly sensitive to small updates.