PROSAC: Provably Safe Certification for Machine Learning Models under Adversarial Attacks

Ziquan Liu; zhuo zhi; Ilija Bogunovic; Carsten Gerner-Beuerle; Miguel Rodrigues

PROSAC: Provably Safe Certification for Machine Learning Models under Adversarial Attacks

Ziquan Liu, zhuo zhi, Ilija Bogunovic, Carsten Gerner-Beuerle, Miguel Rodrigues

Published: 27 Oct 2023, Last Modified: 12 Dec 2023RegML 2023EveryoneRevisionsBibTeX

Keywords: Adversarial Risk Certification; AI Safety

TL;DR: We develop a framework to produce a certification of population adversarial risks for machine learning models.

Abstract: It is widely known that state-of-the-art machine learning models — including vision and language models — can be seriously compromised by adversarial perturbations, so it is also increasingly relevant to develop capability to certify their performance in the presence of the most effective adversarial attacks. Our paper offers a new approach to certify the performance of machine learning models in the presence of adversarial attacks, with population level risk guarantees. In particular, given a specific attack, we introduce the notion of a $(\alpha,\zeta)$ machine learning model safety guarantee: this guarantee, which is supported by a testing procedure based on the availability of a calibration set, entails one will only declare that a machine learning model adversarial (population) risk is less than $\alpha$ (i.e. the model is safe) given that the model adversarial (population) risk is higher than $\alpha$ (i.e. the model is in fact unsafe), with probability less than $\zeta$. We also propose Bayesian optimization algorithms to determine very efficiently whether or not a machine learning model is $(\alpha,\zeta)$-safe in the presence of an adversarial attack, along with their statistical guarantees. We apply our framework to a range of machine learning models — including various sizes of vision Transformer (ViT) and ResNet models — impaired by a variety of adversarial attacks such as AutoAttack, SquareAttack and natural evolution strategy attack, in order to illustrate the merit of our approach. Of particular relevance, we show that ViT's are generally more robust to adversarial attacks than ResNets and ViT-large is more robust than smaller models. Overall, our approach goes beyond existing empirical adversarial risk based certification guarantees, paving the way to more effective AI regulation based on rigorous (and provable) performance guarantees.

Submission Number: 3

Loading