A Simple General Method for Detecting Textual Adversarial Examples

Anonymous

A Simple General Method for Detecting Textual Adversarial Examples

Anonymous

16 Nov 2021 (modified: 05 May 2023)ACL ARR 2021 November Blind SubmissionReaders: Everyone

Abstract: Although deep neural networks have achieved state-of-the-art performance in various machine learning and artificial intelligence tasks, adversarial examples, constructed by adding small non-random perturbations to correctly classified inputs, successfully fool highly expressive deep classifiers into incorrect predictions. Approaches to adversarial attacks in natural language tasks have boomed in the last five years using character-level, word-level, phrase-level, or sentence-level textual perturbations. While there is some work in NLP on defending against such attacks through proactive methods, like adversarial training, there is to our knowledge no effective reactive approaches to defence via detection of textual adversarial examples such as is found in the image processing literature. In this paper, we apply distance-based ensemble learning and semantic representations from different representation learning models based on our understanding of the reason for adversarial examples to fill this gap. Our technique, MultiDistance Representation Ensemble Method (MDRE), obtains state-of-the-art results on character-level, word-level, and phrase-level attacks on the IMDB dataset as well as on the later two with respect to the MultiNLI dataset. If this paper is accepted, we will publish our code.

0 Replies

Loading