Adaptive Adversarial Training for Balancing Model Robustness and Standard Performance

ACL ARR 2025 February Submission6508 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Adversarial training (AT) is widely used to boost model robustness against adversarial attacks, i.e., adding minor perturbations on the clean input to fool the target model. However, AT can also lead to degraded clean accuracy since it changes the distribution of the training set. Using the Taylor expansion, we find that commonly used adversarial loss functions inherently include clean loss, making it challenging for previous methods to effectively balance accuracy and robustness. Based on this, we establish a flexible AT framework that can explicitly balance the model robustness and clean accuracy by assigning learnable weights to the decomposed adversarial loss. Comprehensive experimental results indicate that our method boosts model robustness while maintaining comparable standard performance.
Paper Type: Long
Research Area: Machine Learning for NLP
Research Area Keywords: adversarial training
Contribution Types: Model analysis & interpretability, NLP engineering experiment
Languages Studied: English
Submission Number: 6508
Loading