Abstract: Generative large language models (LLMs) have demonstrated outstanding performance across a wide range of language-related tasks. However, these models may inherit or even amplify biases in their training data, potentially leading to adverse impacts on specific individuals or groups. To this end, many recent studies are devoted to mitigating the biases of generative LLMs with considerable success. Due to the potential for significant harm, it is essential to ensure that generative LLMs generate consistently fair outputs even under adversarial influence. Nevertheless, the issue of bias in generative LLMs has not been studied from an adversarial attack perspective. Moreover, existing research on adversarial attacks in generative LLMs mainly focuses on English texts with less attention to Chinese texts. In this paper, to bridge this gap, we first construct a Chinese bias dataset and utilize it to train a Chinese bias classifier. Furthermore, we propose a Polarity-Based Bias Adversarial Attack (PBAA) for generative LLMs to produce biased responses. Specifically, we design an adaptive beam search mechanism to generate potential adversarial prompts that are smooth and semantically preserved. Then, we develop bias-oriented selection strategy to select adversarial prompts that can maximize model bias. We evaluate PBAA in various open-source and closed-source models in both Chinese and English scenarios. The results show that it is not only successful in inducing models to produce biased responses, but also exhibits strong transferability. In addition, PBAA can be effective against adversarial defense strategies.
External IDs:doi:10.1109/taslpro.2025.3645995
Loading