Gradient Sparsification For \emph{Masked Fine-Tuning} of Transformers

Anonymous

Gradient Sparsification For \emph{Masked Fine-Tuning} of Transformers

Anonymous

16 Nov 2021 (modified: 05 May 2023)ACL ARR 2021 November Blind SubmissionReaders: Everyone

Abstract: Fine-tuning masked language models is widely adopted for transfer learning to downstream tasks and can be achieved by (1) freezing gradients of the pretrained network or only updating gradients of a newly added classification layer or (2) performing gradient updates on all parameters. Gradual unfreezing trades off between the two by gradually unfreezing gradients of whole layers during training. We propose to extend this to {\em stochastic gradient masking} to regularize pretrained language models for improved fine-tuning performance. We introduce \emph{GradDrop} and variants thereof, a class of gradient sparsification methods that mask gradients prior to gradient descent. Unlike gradual unfreezing which is non-sparse and deterministic, GradDrop is sparse and stochastic. Experiments on the multilingual XGLUE benchmark with XLM-R$_{\text{Large}}$ show that \emph{GradDrop} outperforms standard fine-tuning and gradual unfreezing, while being competitive against methods that use additional translated data and intermediate pretraining. Lastly, we identify cases where largest zero-shot performance gains are on less resourced languages.

0 Replies

Loading