Abstract: An adversarial attack paradigm explores various scenarios for the vulnerability of
deep learning models: minor changes of the input can force a model failure. Most
of the state of the art frameworks focus on adversarial attacks for images and other
structured model inputs, but not for categorical sequences models.
Successful attacks on classifiers of categorical sequences are challenging because
the model input is tokens from finite sets, so a classifier score is non-differentiable
with respect to inputs, and gradient-based attacks are not applicable. Common
approaches deal with this problem working at a token level, while the discrete
optimization problem at hand requires a lot of resources to solve.
We instead use a fine-tuning of a language model for adversarial attacks as a gener-
ator of adversarial examples. To optimize the model, we define a differentiable loss
function that depends on a surrogate classifier score and on a deep learning model
that evaluates approximate edit distance. So, we control both the adversability of a
generated sequence and its similarity to the initial sequence.
As a result, we obtain semantically better samples. Moreover, they are resistant to
adversarial training and adversarial detectors. Our model works for diverse datasets
on bank transactions, electronic health records, and NLP datasets.
0 Replies
Loading