Robust Tuning of Pre-trained Language Models: a Parameter-efficient Approach

Robust Tuning of Pre-trained Language Models: a Parameter-efficient Approach

ACL ARR 2025 May Submission4847 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Fine-tuning pre-trained language models (PLMs) has demonstrated remarkable performance in downstream tasks. These models, however, are vulnerable to adversarial attacks. Defenses based on adversarial fine-tuning, i.e., fine-tuning PLMs with adversarial examples, have been proposed to counter this vulnerability. However, such defenses suffer from unsatisfactory performance due to catastrophic forgetting, meaning they fail to retain the robust features learned during pre-training. In this paper, we propose a novel parameter-efficient adversarial fine-tuning method that tunes only a small subset of the model’s parameters, leaving the majority intact. Our method involves training a defense soft prompt prepended to inputs, which leads to robust predictions by PLMs. Our extensive experiments demonstrate the effectiveness of our proposed defenses across various benchmarks and PLMs.

Paper Type: Short

Research Area: Interpretability and Analysis of Models for NLP

Research Area Keywords: Robust Fine-tuning; adversarial Training; Robustness; Pre-trained Language Models; Soft Prompt

Contribution Types: Model analysis & interpretability

Languages Studied: English

Submission Number: 4847

Loading