Adapting to Evolving Adversaries with Regularized Continual Robust Training

Sihui Dai; Christian Cianfarani; Vikash Sehwag; Prateek Mittal; Arjun Bhagoji

Adapting to Evolving Adversaries with Regularized Continual Robust Training

Sihui Dai, Christian Cianfarani, Vikash Sehwag, Prateek Mittal, Arjun Bhagoji

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY-NC-SA 4.0

TL;DR: Regularization helps when sequentially adapting to new test-time attacks

Abstract: Robust training methods typically defend against specific attack types, such as $\ell_p$ attacks with fixed budgets, and rarely account for the fact that defenders may encounter new attacks over time. A natural solution is to adapt the defended model to new adversaries as they arise via fine-tuning, a method which we call continual robust training (CRT). However, when implemented naively, fine-tuning on new attacks degrades robustness on previous attacks. This raises the question: \textit{how can we improve the initial training and fine-tuning of the model to simultaneously achieve robustness against previous and new attacks?} We present theoretical results which show that the gap in a model's robustness against different attacks is bounded by how far each attack perturbs a sample in the model's logit space, suggesting that regularizing with respect to this logit space distance can help maintain robustness against previous attacks. Extensive experiments on 3 datasets (CIFAR-10, CIFAR-100, and ImageNette) and over 100 attack combinations demonstrate that the proposed regularization improves robust accuracy with little overhead in training time. Our findings and open-source code lay the groundwork for the deployment of models robust to evolving attacks.

Lay Summary: Adversarial examples are a phenomenon where neural networks are fooled by small imperceptible perturbations. Many existing techniques for training neural networks to be robust against adversarial examples focus on a specific type of perturbations (i.e. those which lie within a specific bounded distance from the original image). These methods typically include examples of perturbed inputs during training to teach the model the kind of mistakes to avoid. However, the types of imperceptible perturbations to which models are vulnerable is large and over time, we may be able to generate perturbations (i.e. small transformations to the inputs) which were not considered when robustly training the model (researchers have come up with many clever transformations, including changing the color of pixels slightly!). In this work, we propose repeated training of the model against new attacks, using the previous iteration of the model as a starting point. We also add a term during training designed to prevent the model outputs from drifting too far away from each other for different perturbation types. This helps with robustness to new perturbations while not `forgetting' robustness against previous types. Our approach takes steps towards achieving robust models that can be easily adapted to new attacks, which is important for applications in which robustness is critical and model retraining is expensive.

Link To Code: https://github.com/inspire-group/continual_robust_training/

Primary Area: Deep Learning->Robustness

Keywords: Adversarial training, multi-attack, fine-tuning, regularization

Submission Number: 11689

Loading