On the Computational Efficiency of Adapting Transformer Models via Adversarial NoiseDownload PDF

16 May 2022 (modified: 05 May 2023)NeurIPS 2022 SubmittedReaders: Everyone
Keywords: Efficient Training Methods, Pre-trained Transformer Networks, Distributed Training
Abstract: Pretraining Transformer-based language models followed by adapting the pre-trained models to a downstream task is an effective transfer mechanism in NLP. While it is well-known that the pretraining stage is computationally expensive, even the adaptation starts to become time-consuming for many downstream tasks as Transformers grow in size rapidly. Prior work focuses on reducing the pretraining wall-clock time via increasing the batch size to obtain higher training throughput on multiple processors. However, few studies have explored how such a scheme may benefit the adaptation phase. On the other hand, adversarial training has shown improved generalization for adapting Transformer models on many NLP tasks, but it is often treated as a separate line of research, where its effectiveness under the large-batch regime is not well understood. In this paper, we show that adversarial training obtains promising model accuracy even with a considerably larger batch size. However, the computational complexity associated with this approach, due to the high cost of generating adversaries, prevents it from reducing adaptation costs with an increasing number of processors. As such, we systematically study adversarial large-batch optimization for adapting transformers from a computational complexity perspective. Our investigation yields efficient and practical algorithms for adapting transformer models. We show in experiments that our proposed method attains up to 9.8$\times$ adaptation speedups over the baseline on BERT$_{base}$ and RoBERTa$_{large}$, while achieving comparable and sometimes higher accuracy than the state-of-the-art large-batch optimization methods.
TL;DR: Our detailed analysis of the computation efficiency in adversarial large-batch optimization leads to a simple yet practical method that accelerates model adaptation of Transformers by up to 9.8 times.
Supplementary Material: pdf
18 Replies

Loading