NUAT-GAN: Generating Black-Box Natural Universal Adversarial Triggers for Text Classifiers Using Generative Adversarial Networks

Published: 01 Jan 2024, Last Modified: 13 May 2025IEEE Trans. Inf. Forensics Secur. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Recent works have demonstrated that text classifiers are vulnerable to universal adversarial triggers (UATs), which are concatenated to any original text from the dataset to mislead text classifiers. Existing methods for generating UATs are limited to a white-box setting, where the adversary needs access to the gradient information about the target model. In the more practical black-box setting, the adversary can only access the logit output of the target model, which increases the difficulty of crafting UATs. In this paper, we propose a framework for generating natural UATs using generative adversarial networks (NUAT-GAN) in the black-box setting. To update parameters of the generator in the black-box setting, we design a training generator algorithm with policy gradient (TGPG), in which the gradient of the target model is replaced with the policy gradient of reinforcement learning. On three text classification datasets, we evaluate the attack and the natural performance of UATs generated on LSTM, CNN and BERT models. Results show that UATs generated by NUAT-GAN can mislead the above models. The average values of the Attack Success Rate (ASR) and GPT-2 Loss of UATs are 0.81 and 9.10, respectively. The UATs can effectively attack online models, such as AllenNLP and ChatGPT. Moreover, we replace the reward given by the discriminator with GPT-2 Loss, the attack and the natural performance of UATs are close to those of NUAT-GAN. This shows that NUAT-GAN is extensible and can combined with the language model.
Loading