Extreme Confidence and the Illusion of Robustness in Adversarial Training

Anonymous

Extreme Confidence and the Illusion of Robustness in Adversarial Training

Anonymous

16 Dec 2023ACL ARR 2023 December Blind SubmissionReaders: Everyone

TL;DR: Heavily miscalibrated models display an illusion of robustness

Abstract: Deep learning-based Natural Language Processing (NLP) models are vulnerable to adversarial attacks, where small perturbations can cause a model to misclassify. Adversarial Training (AT) is often used to increase model robustness. Despite the challenging nature of textual inputs, numerous AT approaches have emerged for NLP models. However, we have discovered an intriguing phenomenon: deliberately or accidentally (implicitly as part of existing AT schemes) miscalibrating models such that they are extremely overconfident or underconfident in their predictions, disrupts adversarial attack search methods, giving rise to an apparent increase in robustness. However, we demonstrate that the observed gain in robustness is an illusion of robustness (IOR), as an adversary aware of this miscalibration can perform temperature calibration to modify the predicted model logits, allowing the adversarial attack search method to find adversarial examples whereby obviating IOR. Consequently, we urge adversarial robustness researchers to incorporate adversarial temperature scaling approaches into their evaluations to mitigate IOR.

Paper Type: long

Research Area: Interpretability and Analysis of Models for NLP

Contribution Types: Model analysis & interpretability

Languages Studied: English

0 Replies

Loading