Keywords: adversarial examples, certified defenses, degradation attacks
Abstract: Certifiably robust neural networks employ provable run-time defenses against adversarial examples by checking if the model is locally robust at the input under evaluation. We show through examples and experiments that these defenses are inherently over-cautious. Specifically, they flag inputs for which local robustness checks fail, but yet that are not adversarial; i.e., they are classified consistently with all valid inputs within a distance of $\epsilon$. As a result, while a norm-bounded adversary cannot change the classification of an input, it can use norm-bounded changes to degrade the utility of certifiably robust networks by forcing them to reject otherwise correctly classifiable inputs. We empirically demonstrate the efficacy of such attacks against state-of-the-art certifiable defenses.
One-sentence Summary: Certifiably robust neural networks are too conservative, making them vulnerable to degradation attacks
17 Replies
Loading