Abstract: Adversarial training is a well-known methodology for enhancing language models and avoiding harmful responses and misclassification. Although adversarial training has gained empirical success, many existing methods to create embeddings via query-based adversarial samples that are different from actual realistic text adversarial features during the training process. In this work, we propose UnGAT and MulGAT, new approaches for adversarial training. They produce perturbations as discrete tokens rather than apply perturbations to embedding representations during whole training process. In particular, both UnGAT and MulGAT consist of a generator that produces adversarial text and a victim model fine-tuned on both original and adversarial text. While UnGAT's generator is fine-tuned to fool victim model without adversarial dataset, MulGAT transfers adversarial features from source tasks to unseen tasks via a generator fine-tuned on multi-task adversarial dataset. Experiments on text classification and dialogue generation demonstrate the effectiveness of our approaches over many state-of-the-art baselines.
Paper Type: Long
Research Area: Machine Learning for NLP
Research Area Keywords: adversarial training, adversarial defense
Contribution Types: NLP engineering experiment, Reproduction study
Languages Studied: English
Submission Number: 1017
Loading