AdvCodec: Towards A Unified Framework for Adversarial Text Generation

Anonymous

Sep 25, 2019 ICLR 2020 Conference Blind Submission readers: everyone Show Bibtex
  • TL;DR: we propose a novel framework AdvCodec to generate adversarial text agaist general NLP tasks based on tree-autoencoder, and we show that AdvCodec outperforms other baselines and achieves high performance in human evaluation.
  • Abstract: Machine learning (ML) especially deep neural networks (DNNs) have been widely applied to real-world applications. However, recent studies show that DNNs are vulnerable to carefully crafted \emph{adversarial examples} which only deviate from the original data by a small magnitude of perturbation. While there has been great interest on generating imperceptible adversarial examples in continuous data domain (e.g. image and audio) to explore the model vulnerabilities, generating \emph{adversarial text} in the discrete domain is still challenging. The main contribution of this paper is to propose a general targeted attack framework \advcodec for adversarial text generation which addresses the challenge of discrete input space and be easily adapted to general natural language processing (NLP) tasks. In particular, we propose a tree based autoencoder to encode discrete text data into continuous vector space, upon which we optimize the adversarial perturbation. With the tree based decoder, it is possible to ensure the grammar correctness of the generated text; and the tree based encoder enables flexibility of making manipulations on different levels of text, such as sentence (\advcodecsent) and word (\advcodecword) levels. We consider multiple attacking scenarios, including appending an adversarial sentence or adding unnoticeable words to a given paragraph, to achieve arbitrary \emph{targeted attack}. To demonstrate the effectiveness of the proposed method, we consider two most representative NLP tasks: sentiment analysis and question answering (QA). Extensive experimental results show that \advcodec has successfully attacked both tasks. In particular, our attack causes a BERT-based sentiment classifier accuracy to drop from $0.703$ to $0.006$, and a BERT-based QA model's F1 score to drop from $88.62$ to $33.21$ (with best targeted attack F1 score as $46.54$). Furthermore, we show that the white-box generated adversarial texts can transfer across other black-box models, shedding light on an effective way to examine the robustness of existing NLP models.
  • Keywords: adversarial text generation, tree-autoencoder, human evaluation
0 Replies

Loading