TL;DR: We present a framework for evaluating adversarial examples in natural language processing and demonstrate that generated adversarial examples are often not semantics-preserving, syntactically correct, or non-suspicious.
Abstract: Attacks on natural language models are difficult to compare due to their different definitions of what constitutes a successful attack. We present a taxonomy of constraints to categorize these attacks. For each constraint, we present a real-world use case and a way to measure how well generated samples enforce the constraint. We then employ our framework to evaluate two state-of-the art attacks which fool models with synonym substitution. These attacks claim their adversarial perturbations preserve the semantics and syntactical correctness of the inputs, but our analysis shows these constraints are not strongly enforced. For a significant portion of these adversarial examples, a grammar checker detects an increase in errors. Additionally, human studies indicate that many of these adversarial examples diverge in semantic meaning from the input or do not appear to be human-written. Finally, we highlight the need for standardized evaluation of attacks that share constraints. Without shared evaluation metrics, it is up to researchers to set thresholds that determine the trade-off between attack quality and attack success. We recommend well-designed human studies to determine the best threshold to approximate human judgement.
Keywords: adversarial examples, natural language processing, analysis
Original Pdf: pdf