Keywords: Adversarial Risk Certification; AI Safety
TL;DR: We develop a framework to produce a certification of population adversarial risks for machine learning models.
Abstract: It is widely known that state-of-the-art machine learning models — including vision and language ones — can be seriously compromised by adversarial perturbations, so it is also increasingly relevant to develop capability to certify their performance in the presence of the most effective adversarial attacks. Our paper offers a new approach to certify the performance of machine learning models in the presence of adversarial attacks, with population level risk guarantees. In particular, given a specific attack, we introduce the notion of a $(\alpha,\zeta)$ machine learning model safety guarantee: this guarantee, which is supported by a testing procedure based on the availability of a calibration set, entails one will only declare that a machine learning model adversarial (population) risk is less than $\alpha$ (i.e. the model is safe) given that the model adversarial (population) risk is higher than $\alpha$ (i.e. the model is in fact unsafe), with probability less than $\zeta$. We also propose Bayesian optimization algorithms to determine very efficiently whether or not a machine learning model is $(\alpha,\zeta)$-safe in the presence of an adversarial attack, along with their statistical guarantees. We apply our framework to a range of machine learning models — including various sizes of vision Transformer (ViT) and ResNet models — impaired by a variety of adversarial attacks such as AutoAttack, SquareAttack and natural evolution strategy attack, in order to illustrate the merit of our approach. Of particular relevance, we show that ViT's are generally more robust to adversarial attacks than ResNets and ViT-large is more robust than smaller models. Overall, our approach goes beyond existing empirical adversarial risk based certification guarantees, paving the way to more effective AI regulation based on rigorous (and provable) performance guarantees.
Primary Area: societal considerations including fairness, safety, privacy
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 168
Loading