Keywords: NLP, language model, robustness, classification
Abstract: Fine-tuning pre-trained language models (LMs) has become the de-facto standard method for improving state-of-the-art performances on various NLP tasks. Although these models are usually evaluated with accuracy on fixed validation sets, it is insufficient for the reliable deployment of fine-tuned LMs in real-world settings, as there are known issues within existing model evaluations, such as adversarial robustness and model calibration. To address such issues, we propose a simple yet effective training algorithm, coined Robustifying LMs via Adversarial training with Masked gradient (RAM), to improve the robustness of fine-tuned LMs. In particular, we leverage adversarial training to robustify LMs for various types of perturbations. Simultaneously, to prevent the trained model from largely deviating from the initial pre-trained model, we selectively update the important model parameters using the masked gradients; their relative importance is obtained from the gradients calculated during training. Consequently, it enables the model to preserve the generalizability of the pre-trained model while improving its robustness. Additionally, we construct a new benchmark to evaluate the robustness of fine-tuned LMs in terms of four representative aspects of model robustness in a unified way. Under these benchmarks, we demonstrate the effectiveness of RAM compared to other state-of-the-art fine-tuning methods, and verify that RAM is successfully robustifying various types of LMs. Our work suggests a rethinking of the robustness aspect of LMs as an essential direction for their reliable deployment, along with a simple yet effective solution.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Submission Guidelines: Yes
Please Choose The Closest Area That Your Submission Falls Into: Social Aspects of Machine Learning (eg, AI safety, fairness, privacy, interpretability, human-AI interaction, ethics)
5 Replies
Loading