Abstract: Adversarial samples are helpful to explore vulnerabilities in neural network models, improve model robustness, and explain their working mechanism. However, the adversarial texts generated by existing word substitution-based methods are trapped in a one-to-one attack pattern, which is inflexible and cramped. In this paper, we propose ValCAT, a black-box attack framework that misleads the language model by applying variable-length contextualized transformations to the original text. Experiments show that our method outperforms state-of-the-art methods on attacking several classification tasks and inference tasks. More comprehensive human evaluations demonstrate that ValCAT has a significant advantage in ensuring the fluency of the adversarial samples and achieves better semantic consistency. We release our code at https://github.com/linerxliner/ValCAT.
Paper Type: long
0 Replies
Loading