Calibration Attacks: A Comprehensive Study of Adversarial Attacks on Model Confidence

Stephen Obadinma; Xiaodan Zhu; Hongyu Guo

Calibration Attacks: A Comprehensive Study of Adversarial Attacks on Model Confidence

Stephen Obadinma, Xiaodan Zhu, Hongyu Guo

Published: 22 Nov 2024, Last Modified: 22 Nov 2024Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: In this work, we highlight and perform a comprehensive study on calibration attacks, a form of adversarial attacks that aim to trap victim models to be heavily miscalibrated without altering their predicted labels, hence endangering the trustworthiness of the models and follow-up decision making based on their confidence. We propose four typical forms of calibration attacks: underconfidence, overconfidence, maximum miscalibration, and random confidence attacks, conducted in both the black-box and white-box setups. We demonstrate that the attacks are highly effective on both convolutional and attention-based models: with a small number of queries, they seriously skew confidence without changing the predictive performance. Given the potential danger, we further investigate the effectiveness of a wide range of adversarial defence and recalibration methods, including our proposed defences specifically designed for calibration attacks to mitigate the harm. From the ECE and KS scores, we observe that there are still significant limitations in handling calibration attacks. To the best of our knowledge, this is the first dedicated study that provides a comprehensive investigation on calibration-focused attacks. We hope this study helps attract more attention to these types of attacks and hence hamper their potential serious damages. To this end, this work also provides detailed analyses to understand the characteristics of the attacks.

Submission Length: Regular submission (no more than 12 pages of main content)

Changes Since Last Submission: Dear Action Editor, First of all, we thank you for your comments. We have implemented all your suggested changes in our revision. Specifically, below please kindly find a summary of our revision. **AE comment 1**: add discussion of the significance of the problem, in Section 3.4 or elsewhere. This can be a summary of the response to Reviewer 9Y8b. Thank you. We have added into Section 3.4 a summary of the response to Reviewer 9Y8b and adjusted the original content of Section 3.4 to accommodate the change. In addition, the detailed, full-version discussion has been included in Appendix B, which has been referenced in Section 3.4 in case the readers want to read more details. **AE comment 2**: Add discussion on the focus of ECE versus downstream metrics. We have added a discussion in Section 4.1 to describe the focus on intrinsic metrics (ECE and KS) versus downstream metrics, and why we focus on the former in our research. Specifically, the following passage has been added under the “Metrics” part in Section 4.1. “Same as in the previous work (Guo et al., 2017; Kumar et al., 2018; Minderer et al., 2021), we focus on intrinsic evaluation metrics in this research, which isolate our analysis from downstream tasks to abstain from task-specific conclusions, particularly considering the downstream tasks here are very diverse. In addition, the harm from miscalibration in downstream applications has been well studied in prior work, as detailed in Appendix B.” **AE comment 3**: add experimental results on the transferrability of the attacks across models, per the response to Reviewer 9Y8b. Thank you. We have added the suggested experimental results on the transferability of the attacks across models in Section 4.7, titled “Attack Transferability”. Specifically, the following content has been added. “Determining the transferability of calibration attacks across different model architectures is of interest since although one can effectively optimize attacks against a single model, using an ensemble model could potentially be a form of defence if transferability is limited. We investigate how our attacks transfer to different victim models. Specifically in our experiments, the attacks are constructed on CIFAR-100 images with regular settings based on ViT but used on (i) ResNet and (ii) ensemble victim models combining prediction of both ResNet-50 and ViT (Table 4). We focus on the transfer of the UCA attack, given that OCA has little room to skew confidence due to the already high accuracy of victim models. The attacks show transferability, although in general they become less effective when applied on different architectures. This is expected because the attacks are constructed without considering the target victim models’ properties. Most notably, the simple ensemble of two typical victim models (ViT and ResNet) is still greatly subject to the attacks (the output distributions of ResNet and ViT are averaged to get the ensemble), so using ensembles as a defence does not fully protect against calibration attacks. To further explore transferability, attack methods need to be explicitly redesigned (Xie et al., 2018; Wang et al., 2021), which could be further studied in the future.” **AE comment 4**: add experimental results on Swin Transformers, per the response to Reviewer nBsh. We have summarized and added the observation on Swin transformer in the first paragraph of Section 4.2, confirming it shows similar trends as those presented in Table 1. The detailed results have been added into appendix H.8, considering ResNet and ViT are regarded as the most typical frameworks for convolution and attention-based models, respectively. The Swin transformer is an extension of ViT, and the experiments results on it agree with those on ViT. Specifically, the following description has been added into the first paragraph of Section 4.2. “While ResNet and ViT are the most representative frameworks for convolution and attention-based models respectively, in Appendix H.8 we include additional experiments on the Swin Transformer (Liu et al., 2021), which is a famous extension of ViT. The results show similar trends as those presented in Table 1. Again, Figure 1 in the introduction section demonstrates the attack effects using calibration diagrams, visually showing the severity of miscalibration.” **AE comment 5**: typos Thank you. We have fixed them. We have also read the paper multiple more times. **AE comment 6**: Fig. 1-2 font sizes We have increased font sizes of x-axis in Fig 1 and legends in Fig 2. **AE comment 7**: format of Eq 3. Thank you. We have used the \rm font. **Additional Changes**: In addition, we have added section 4.8 “Influence of Model Scale on Attack Effectiveness” to include results that we provided in the rebuttal period, to address the suggestions from the reviewer about the effect of model scale. Finally, we really appreciate your time and the reviewers' time. Thank you! Best regards,

Code: https://github.com/PhenetOs/CalibrationAttack

Assigned Action Editor: ~Aditya_Menon1

Submission Number: 2705

Loading