Sentences Based Adversarial Attack on AI-Generated Text Detectors

Rongxin Tu, Xiangui Kang, Chee Wei Tan, Chi-Hung Chi, Kwok-Yan Lam

Published: 2026, Last Modified: 12 Mar 2026IEEE Trans. Big Data 2026EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: The widespread use of AI-generated text has introduced significant security concerns, driving the need for reliable detection systems. However, recent studies reveal that neural network-based detectors are vulnerable to adversarial examples. To improve the robustness of such classifiers, a number of adversarial attack strategies have been developed, particularly in the context of text sentiment classification. Most existing adversarial attack methods focus on the semantics of individual words or sentences, often neglecting the broader contextual semantics of the entire text—particularly in the case of long AI-generated text. This limitation frequently results in adversarial examples that lack fluency and coherence. In this paper, we propose a novel method called Sentence-based Adversarial attack on AI-Generated Text detectors (SAGT), which generates linguistically fluent adversarial examples by inserting model-generated sentences into the original text. To ensure contextual semantic consistency, we extract important keywords from the original text—selected based on changes in the detector's confidence score—and incorporate them into the generated sentences. Extensive experimental results demonstrate that adversarial examples crafted by SAGT can effectively evade AI-generated text detectors.

External IDs:dblp:journals/tbd/TuKTCL26