Efficient Generative Adversarial Training for Language Models via Multi-task Feature Transfer

Efficient Generative Adversarial Training for Language Models via Multi-task Feature Transfer

ACL ARR 2024 December Submission1017 Authors

15 Dec 2024 (modified: 05 Feb 2025)ACL ARR 2024 December SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Adversarial training is a well-known methodology for enhancing language models and avoiding harmful responses and misclassification. Although adversarial training has gained empirical success, many existing methods to create embeddings via query-based adversarial samples that are different from actual realistic text adversarial features during the training process. In this work, we propose UnGAT and MulGAT, new approaches for adversarial training. They produce perturbations as discrete tokens rather than apply perturbations to embedding representations during whole training process. In particular, both UnGAT and MulGAT consist of a generator that produces adversarial text and a victim model fine-tuned on both original and adversarial text. While UnGAT's generator is fine-tuned to fool victim model without adversarial dataset, MulGAT transfers adversarial features from source tasks to unseen tasks via a generator fine-tuned on multi-task adversarial dataset. Experiments on text classification and dialogue generation demonstrate the effectiveness of our approaches over many state-of-the-art baselines.

Paper Type: Long

Research Area: Machine Learning for NLP

Research Area Keywords: adversarial training, adversarial defense

Contribution Types: NLP engineering experiment, Reproduction study

Languages Studied: English

Submission Number: 1017

Loading