Generating Natural Language Adversarial Examples

Moustafa Alzantot, Yash Sharma, Ahmed Elgohary, Bo-Jhang Ho, Mani B. Srivastava, Kai-Wei Chang

2018 (modified: 16 Jul 2019)EMNLP 2018Readers: Everyone

Abstract: Deep neural networks (DNNs) are vulnerable to adversarial examples, perturbations to correctly classified examples which can cause the network to misclassify. In the image domain, these perturbations can often be made virtually indistinguishable to human perception, causing humans and state-of-the-art models to disagree. However, in the natural language domain, small perturbations are clearly perceptible, and the replacement of a single word can drastically alter the semantics of the document. Given these challenges, we use a population-based optimization algorithm to generate semantically and syntactically similar adversarial examples. We demonstrate via a human study that 94.3% of the generated examples are classified to the original label by human evaluators, and that the examples are perceptibly quite similar. We hope our findings encourage researchers to pursue improving the robustness of DNNs in the natural language domain.

0 Replies