Abstract: Generated text detectors can effectively detect the machine-generated texts which aim to produce false information to destroy the credibility of the media platform. However, generated text detectors are vulnerable to adversarial example attacks which focus on char-level perturbations to produce many word errors. In this paper, we design a sentence granularity based black-box attack model Sentence-Keyword-Attack (SK-Attack), which can effectively generate semantics-preserved, fluent, and grammatical adversarial examples. SK-Attack adaptively truncates the input examples based on sentence granularity and searches for the essential sentences to apply a sequence of contextualized perturbations with strict constraints. SK-Attack also applies keyword protection to preserve the keywords from being perturbed. Experiments show that SK-Attack outperforms the baselines when attacking the RoBERTa detector with various challenging generated text datasets and also has strong transferability to other attack models.
Loading