Efficient Generative Adversarial Training for Language Models via Multi-task Feature Transfer

ACL ARR 2024 December Submission1017 Authors

15 Dec 2024 (modified: 05 Feb 2025)ACL ARR 2024 December SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Adversarial training is a well-known methodology for enhancing language models and avoiding harmful responses and misclassification. Although adversarial training has gained empirical success, many existing methods to create embeddings via query-based adversarial samples that are different from actual realistic text adversarial features during the training process. In this work, we propose UnGAT and MulGAT, new approaches for adversarial training. They produce perturbations as discrete tokens rather than apply perturbations to embedding representations during whole training process. In particular, both UnGAT and MulGAT consist of a generator that produces adversarial text and a victim model fine-tuned on both original and adversarial text. While UnGAT's generator is fine-tuned to fool victim model without adversarial dataset, MulGAT transfers adversarial features from source tasks to unseen tasks via a generator fine-tuned on multi-task adversarial dataset. Experiments on text classification and dialogue generation demonstrate the effectiveness of our approaches over many state-of-the-art baselines.
Paper Type: Long
Research Area: Machine Learning for NLP
Research Area Keywords: adversarial training, adversarial defense
Contribution Types: NLP engineering experiment, Reproduction study
Languages Studied: English
Submission Number: 1017
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview