TextDefense: Adversarial Text Detection Based on Word Importance Score Dispersion

Published: 2025, Last Modified: 21 Jan 2026IEEE Trans. Dependable Secur. Comput. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Natural language processing (NLP) models are widely used in various scenarios, yet they are vulnerable to adversarial attacks. Existing works aim to mitigate this vulnerability, but each work targets a specific attack category or has computational overhead limitations, making them vulnerable to adaptive attacks. In this paper, we exhaustively investigate the adversarial attack algorithms in NLP and discover that existing attack algorithms mainly disrupt the importance distribution of words in a text. A well-trained model can distinguish subtle importance distribution differences between clean and adversarial texts. Based on this intuition, we propose TextDefense, a new adversarial example detection framework that utilizes the target model’s capability to defend against adversarial attacks, requiring no prior knowledge. Unlike previous approaches, TextDefense is attack-type agnostic and outperforms existing methods in experiments with different architectures, datasets, and attack methods. We also discover that the target model’s generalizability is a leading factor influencing the performance of TextDefense. Finally, we provide insights into the adversarial attacks in NLP and the principles of our defense method by analyzing the properties of the target model and the adversarial example.
Loading