Reinforce Attack: Adversarial Attack against BERT with Reinforcement Learning

Anonymous

Reinforce Attack: Adversarial Attack against BERT with Reinforcement Learning

Anonymous

17 Aug 2021 (modified: 05 May 2023)ACL ARR 2021 August Blind SubmissionReaders: Everyone

Abstract: Adversarial attacks against textual data has been drawing increasing attention in both the NLP and security domains. Current successful attack methods for text typically consist of two stages: word importance ranking and word replacement. The first stage is usually achieved by masking each word in the sentence one at a time and obtaining the resulting output probability of the target model. The second stage involves finding synonyms to replace “vulnerable” words by the order of ranking. In this paper, we first explore the effects of employing the model explanation tool LIME to generate word importance ranking, which has the advantage of taking the local information around the word into account to obtain word importance scores. We then propose Reinforce Attack, a reinforcement learning (RL) based framework to generate adversarial text. Notably, the attack process is controlled by a reward function rather than heuristics as in previous methods to encourage higher semantic similarity and lower query costs. Through automatic and human evaluations, we show that our LIME + Reinforce Attack method achieves better or comparable attack success rate against other state-of-the-art attack frameworks, while the generated samples preserve significantly higher semantic similarity.

0 Replies

Loading