DiffuseDef: Improved Robustness to Adversarial Attacks

DiffuseDef: Improved Robustness to Adversarial Attacks

ACL ARR 2024 June Submission1780 Authors

15 Jun 2024 (modified: 21 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Pretrained language models have significantly advanced performance across various natural language processing tasks. However, adversarial attacks continue to pose a critical challenge to system built using these models, as they can be exploited with carefully crafted adversarial texts. Inspired by the ability of diffusion models to predict and reduce noise in computer vision, we propose a novel and flexible adversarial defense method for language classification tasks, DiffuseDef, which incorporates a diffusion layer as a denoiser between the encoder and the classifier. During inference, the adversarial hidden state is first combined with sampled noise, then denoised iteratively and finally ensembled to produce a robust text representation. By integrating adversarial training, denoising, and ensembling techniques, we show that DiffuseDef improves over different existing adversarial defense methods and achieves state-of-the-art performance against common adversarial attacks.

Paper Type: Long

Research Area: Interpretability and Analysis of Models for NLP

Research Area Keywords: adversarial attacks, adversarial defense, diffusion, adversarial training

Contribution Types: Model analysis & interpretability

Languages Studied: English

Submission Number: 1780

Loading