Degradation Attacks on Certifiably Robust Neural Networks

Klas Leino; Chi Zhang; Ravi Mangal; Matt Fredrikson; Bryan Parno; Corina Pasareanu

Degradation Attacks on Certifiably Robust Neural Networks

Klas Leino, Chi Zhang, Ravi Mangal, Matt Fredrikson, Bryan Parno, Corina Pasareanu

Published: 28 Jan 2022, Last Modified: 13 Feb 2023ICLR 2022 SubmittedReaders: Everyone

Keywords: adversarial examples, certified defenses, degradation attacks

Abstract: Certifiably robust neural networks employ provable run-time defenses against adversarial examples by checking if the model is locally robust at the input under evaluation. We show through examples and experiments that these defenses are inherently over-cautious. Specifically, they flag inputs for which local robustness checks fail, but yet that are not adversarial; i.e., they are classified consistently with all valid inputs within a distance of $\epsilon$. As a result, while a norm-bounded adversary cannot change the classification of an input, it can use norm-bounded changes to degrade the utility of certifiably robust networks by forcing them to reject otherwise correctly classifiable inputs. We empirically demonstrate the efficacy of such attacks against state-of-the-art certifiable defenses.

One-sentence Summary: Certifiably robust neural networks are too conservative, making them vulnerable to degradation attacks

17 Replies

Loading