bfOneWORD: Adversarial Text Detection and Prediction Restoration Using One-Word Perturbation

Published: 2024, Last Modified: 11 Feb 2026ICONIP (9) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Although deep learning models achieve superior performance on original text, they are sensitive to adversarial attacks. These attacks remarkably deceive these models by generating adversarial text with imperceptible changes while preserving the original text’s meaning. Current detection methods are effective against specific adversarial attacks but often fail when faced with unconventional adversarial texts from other attack types. We introduce \(\textrm{OneWORD}\), a novel method designed to detect a wide range of adversarial texts. \(\textrm{OneWORD}\) perturbs a single word in the input text and monitors changes in the prediction labels of the perturbed text. This method not only detects adversarial texts but also efficiently restores their prediction labels. Experimental results across diverse attacks, models, and datasets show that \(\textrm{OneWORD}\) surpasses existing methods in both detecting and restoring predictions for adversarial texts generated by various attack strategies.
Loading