Fine-tuning more stable neural text classifiers for defending word level adversarial attacks

Zibo Yi, Jie Yu, Yusong Tan, Qingbo Wu

Published: 01 Jan 2022, Last Modified: 05 Jun 2025Appl. Intell. 2022EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Text adversarial attack is a serious problem in natural language processing applications. Neural text classifiers can be misled by perturbed examples, which have several characters or words modified. Defending word-level adversarial attack is also a challenge task for the reason that adversarial examples’ spelling, grammar, and semantics are all correct. There are two main problems with current defense methods. First, they usually reduce the accuracy of the classifier. Second, the defensive effect cannot be guaranteed. We propose StaFF: Stability Fine-tuning Framework to defend word-level adversarial attacks, while maintaining the classification accuracy on clean examples. In the framework, we propose stability, which is quantified as the change of probability distribution caused by small perturbations. Then we fine-tune the classifier with a new optimization objective to ensure both accuracy and stability. Extensive experiments show that the classifier enhanced by StaFF will not reduce the classification accuracy, or even improve it. With the help of StaFF, it is very difficult for word-level adversarial attacks to successfully attack the classifiers. Besides, the classifier trained with StaFF can classify most of the adversarial examples correctly, and its accuracy outperforms the existing word-level defense baselines.