Abstract: Despite significant improvements in natural
language understanding models with the advent of models like BERT and XLNet, these
neural-network based classifiers are vulnerable to blackbox adversarial attacks, where the
attacker is only allowed to query the target
model outputs. We add two more realistic restrictions on the attack methods, namely limiting the number of queries allowed (query
budget) and crafting attacks that easily transfer across different pre-trained models (transferability), which render previous attack models impractical and ineffective. Here, we propose a target model agnostic adversarial attack
method with a high degree of attack transferability across the attacked models. Our empirical studies show that in comparison to baseline
methods, our method generates highly transferable adversarial sentences under the restriction
of limited query budgets.
0 Replies
Loading