Adversarial Examples for Chinese Text Classification

Yushun Xie, Zhaoquan Gu, Bin Zhu, Le Wang, Weihong Han, Lihua Yin

2020 (modified: 14 Jun 2021)DSC 2020Readers: Everyone

Abstract: Deep neural networks (DNNs) have been widely adopted in various areas such as image recognition and natural language processing. However, many works show that DNNs for image classification are vulnerable to adversarial examples, which are generated by adding small-magnitude perturbations to the original inputs. In this paper, we show that DNNs for Chinese text classification are also vulnerable to adversarial examples. We propose a marginal attack method to generate adversarial examples that could fool the DNNs. This method adopts the Naïve Bayes principle to filter sensitive words and it only adds a small number of sensitive words at the end of the original text. The generated adversarial example could fool a variety of Chinese text classification DNNs, such that the text would be classified to incorrect category with high probability. We conduct extensive experiments to evaluate the attack performance and the results show that the success ratio of the attacks could reach almost 100% by adding only five sensitive words.

0 Replies