Perturbing and Backtracking Based Textual Adversarial Attack

Yuanxin Qiao, Ruilin Xie, Songcheng Xie, Zhanqi Cui

Published: 2024, Last Modified: 26 Jan 2026SMC 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: In the field of Natural Language Processing (NLP), Language Models (LMs) are widely applied in tasks such as text classification, machine translation, and knowledge reasoning. However, the defects of LMs make them vulnerable to adversarial attacks, resulting in substantial economic losses. Adversarial examples can effectively expose vulnerabilities of LMs and be used for adversarial training to improve the robustness of the models. Existing methods mostly generate adversarial examples by first selecting important tokens and then adding perturbations to them. Such methods require a large number of queries to the victim model, which is not applicable in scenarios where the query budget is limited. To address the imperative demands for more query-efficient adversarial example generation, this paper presents CBAPB, a Classification Boundary Adjacent Perturbation and Back-track based textual adversarial attack method, which initially introduces coarse-grained perturbations at random positions while preserving the original semantics of input examples until they reach the similarity threshold. Subsequently, fine-grained perturbation backtracking is conducted on all successfully misclassified examples to minimize perturbation magnitudes. We conduct multiple experiments on the Yelp Reviews, AG News, and DBpedia datasets by employing BERT as the victim model. Comparative analysis against baselines reveals that CBAPB requires merely 3.2% of the average query times of these baselines, while increasing the attack success rate by 7.6%, with only a slight decrease of 1.5% in textual similarity. Experimental results demonstrate the effectiveness of CBAPB, which is not only a query-efficient method but also with greater attack success rates.

External IDs:dblp:conf/smc/QiaoXXC24