ChatGPT in threefold: As attacker, target, and evaluator in adversarial attacks

Published: 01 Jan 2025, Last Modified: 17 Apr 2025Neurocomputing 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: ChatGPT is one of the most popular large language models (LLMs). It has achieved state-of-the-art (SOTA) performance in various natural language processing (NLP) tasks. Recent studies have shown that deep learning (DL) models reveal vulnerability to adversarial texts. However, current research on the security of ChatGPT is still in its nascent stages. In this paper, ChatGPT plays three roles in adversarial attacks: an attack method, a target model, and an evaluation tool. We propose a novel black-box word-level adversarial text generation method, ChatGPT Tricker (CGT), for the text classification task. In the framework of adversarial text generation based on word importance, CGT introduces imperceptible perturbations to the original text by utilizing ChatGPT prompts, aiming to craft adversarial texts. Subsequently, we leverage the transferability of adversarial texts to attack ChatGPT and assess its robustness against such attacks. Additionally, we introduce an innovative fluency evaluation method for adversarial texts based on ChatGPT. We define a novel concept termed offset average difference (OAD) to provide a practical formula for the computation of fluency scores. The proposed method enables automated evaluation without human intervention. We utilize CGT to attack both BERT and ChatGPT on real-world English and Chinese text classification datasets. Experimental results demonstrate that CGT balances various aspects of attack performance well. CGT produces more fluent adversarial texts while achieving an attack success rate close to the SOTA baselines. Concurrently, the results also indicate that ChatGPT resists English adversarial texts but is vulnerable to Chinese ones. On the Chinese dataset, CGT can attack ChatGPT with a success rate exceeding 30%.
Loading