Efficient Large-scale Transformer Training via Random and Layerwise Token DroppingDownload PDF

Published: 01 Feb 2023, Last Modified: 13 Feb 2023Submitted to ICLR 2023Readers: Everyone
Keywords: Efficient Training, Large-scale Transformers, Token Dropping, GPT, BERT, ViT
TL;DR: We present a novel random and layerwise token dropping method that can save up to 33.3% theoretical compute cost and 25.6% wall-clock time while achieving comparable accuracy as compared to the standard training procedure.
Abstract: Large-scale transformer models have become the de-facto architectures for various machine learning applications, e.g., CV and NLP. However, those large models also introduce prohibitive training costs. To mitigate this issue, we propose a novel random and layerwise token dropping method (\OURS), which skips the computation of a subset of the input tokens at all middle layers. Particularly, \OURS achieves considerable speedups and comparable accuracy as the standard training baseline. Compared to other token dropping methods, \OURS does not require (1) any importance score-based metrics, (2) any special token treatment (e.g., \texttt{[CLS]}), and (3) many layers in full sequence length training except the first and the last layers. Besides, a new \layertoken learning rate schedule is proposed for pretraining problems that resolve the heavy tuning requirement for our proposed training mechanism. Finally, we demonstrate that \OURS can be applied to broader applications, including \gpt and \bert pretraining as well as ViT and \gpt finetuning tasks. Our results show that \OURS can save about 33.3\% theoretical compute cost and 25.6\% wall-clock training time while achieving similar zero-shot evaluations on \gptb as compared to baseline.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Submission Guidelines: Yes
Please Choose The Closest Area That Your Submission Falls Into: Deep Learning and representational learning
Supplementary Material: zip
19 Replies

Loading