Abstract: Despite achieving excellent benchmark performance, state-of-the-art NLP models can still be easily fooled by adversarial perturbations such as typos. Previous heuristic defenses cannot guard against the exponentially large number of possible perturbations, and previous certified defenses only work with limited model sizes and simple architectures. In this paper, we construct task-agnostic robust encodings (TARE): sentence representations that improve the robustness of any model for multiple downstream tasks at once, and enable efficient exact computation of robust accuracy (accuracy on worst-case perturbations) for a fixed family of perturbations. The core idea behind TARE is to map sentences through a discrete bottleneck before feeding them to a downstream model. To create robust encodings, we must optimize for two competing goals: the encoding of a sentence must retain enough information about the sentence, but should also map all perturbations of the sentence to the same encoding to ensure invariance to perturbations. Averaged across six tasks from GLUE, a standard suite of NLP tasks, the same encoding leads to robust accuracy of 71.2% when defending against a large family of typos, while a strong baseline that uses a typo corrector achieves only 38.5% accuracy, and training on random typos achieves only 9.9% accuracy.
Keywords: Natural language processing, adversarial examples, robustness
Original Pdf: pdf
8 Replies
Loading