Adversarial Examples for Natural Language Classification Problems

Volodymyr Kuleshov; Shantanu Thakoor; Tingfung Lau; Stefano Ermon

Adversarial Examples for Natural Language Classification Problems

Volodymyr Kuleshov, Shantanu Thakoor, Tingfung Lau, Stefano Ermon

15 Feb 2018 (modified: 27 Jun 2023)ICLR 2018 Conference Blind SubmissionReaders: Everyone

Abstract: Modern machine learning algorithms are often susceptible to adversarial examples — maliciously crafted inputs that are undetectable by humans but that fool the algorithm into producing undesirable behavior. In this work, we show that adversarial examples exist in natural language classification: we formalize the notion of an adversarial example in this setting and describe algorithms that construct such examples. Adversarial perturbations can be crafted for a wide range of tasks — including spam filtering, fake news detection, and sentiment analysis — and affect different models — convolutional and recurrent neural networks as well as linear classifiers to a lesser degree. Constructing an adversarial example involves replacing 10-30% of words in a sentence with synonyms that don’t change its meaning. Up to 90% of input examples admit adversarial perturbations; furthermore, these perturbations retain a degree of transferability across models. Our findings demonstrate the existence of vulnerabilities in machine learning systems and hint at limitations in our understanding of classification algorithms.

Data: [Yelp Review Polarity](https://paperswithcode.com/dataset/yelp-review-polarity)

18 Replies

Loading