Detection of textual adversarial Attacks : Benchmark via Out-Of-Distribution examples identification
Abstract: Word-level adversarial attacks have shown success in NLP models, drastically decreasing the performance of transformer-based models. Still, relatively few efforts have been made to detect adversarial examples, even if detecting adversarial examples may be crucial for automated NLP tasks especially for critical applications and models robustness. Pre-trained Transformers achieve high accuracy on in- distribution examples, and also in recent papers have shown to generalize better on out-of-distribution sample than previous models. In this work, we aim at detecting adversarial attacks in Natural Language Processing through Out-Of-Distribution (OOD) detection methods : Maximum Softmax Probability, DOCTOR detector and Mahalanobis distance-based score, on pre-trained Transformers such as BERT and Roberta. In the benchmark we provide, we generate 2 types of attacks on 4 datasets, and evalute the performance with AUROC and AUPR metrics. Our experimental results show that applying these simple out-of-distribution detection scores can provide acceptable performances for adversarial attacks detection. We provide code for our work, \href{https://github.com/Tedonze/Adversarial_Attacks_NLP}{git-hub}.
0 Replies
Loading