Keywords: Adversarial robustness, parameter-efficient fine-tuning
Abstract: Pre-trained foundation models (PTMs) that undergo standard pre-training can be efficiently finetuned for downstream tasks using parameter-efficient fine-tuning (PEFT) methods. However, these models remain highly vulnerable to adversarial perturbations. Existing studies often distribute PEFT parameters uniformly across layers, which overlooks the varying importance of each layer. In this work, we systematically analyze the adversarial robustness of PEFT strategies and introduce a novel vulnerability score, a computationally efficient gradient-based measure that identifies which layers and components are most susceptible to adversarial attacks. Guided by this score, we design robustness-aware PEFT methods: LoRA High, which concentrates parameters in the most vulnerable layers, and LoRA+Adapter, which assigns LoRA to the attention component and adapters to the feed-forward component. Extensive adversarial-training experiments across four real-world image classification datasets show that these targeted PEFT designs consistently outperform vanilla PEFT methods. Post-adversarial finetuning analysis with pruning-style attribution score confirms that strategically protecting vulnerable parts of the backbone is key to robustness in PEFT.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 20222
Loading