Abstract: Fine-tuning pre-trained language models (PLMs) has demonstrated remarkable performance in downstream tasks. These models, however, are vulnerable to adversarial attacks. Defenses based on adversarial fine-tuning, i.e., fine-tuning PLMs with adversarial examples, have been proposed to counter this vulnerability. However, such defenses suffer from unsatisfactory performance due to catastrophic forgetting, meaning they fail to retain the robust features learned during pre-training. In this
paper, we propose a novel parameter-efficient adversarial fine-tuning method that tunes only a small subset of the model’s parameters, leaving the majority intact. Our method involves training a defense soft prompt prepended to inputs, which leads to robust predictions by PLMs. Our extensive experiments demonstrate the effectiveness of our proposed defenses across various benchmarks and PLMs.
Paper Type: Short
Research Area: Interpretability and Analysis of Models for NLP
Research Area Keywords: Robust Fine-tuning; adversarial Training; Robustness; Pre-trained Language Models; Soft Prompt
Contribution Types: Model analysis & interpretability
Languages Studied: English
Submission Number: 4847
Loading