Progressively Stacking 2.0: A Multi-stage Layerwise Training Method for BERT Training SpeedupDownload PDF

28 Sept 2020 (modified: 05 May 2023)ICLR 2021 Conference Blind SubmissionReaders: Everyone
Keywords: BERT, Training speedup, Multi-stage training, Natural language processing
Abstract: Pre-trained language models, such as BERT, have achieved significant accuracy gain in many natural language processing tasks. Despite its effectiveness, the huge number of parameters makes training a BERT model computationally very challenging. In this paper, we propose an efficient multi-stage layerwise training (MSLT) approach to reduce the training time of BERT. We decompose the whole training process into several stages. The training is started from a small model with only a few encoder layers and we gradually increase the depth of the model by adding new encoder layers. At each stage, we only train the top (near the output layer) few encoder layers which are newly added. The parameters of the other layers which have been trained in the previous stages will not be updated in the current stage. In BERT training, the backward calculation is much more time-consuming than the forward calculation, especially in the distributed training setting in which the backward calculation time further includes the communication time for gradient synchronization. In the proposed training strategy, only top few layers participate backward calculation, while most layers only participate forward calculation. Hence both the computation and communication efficiencies are greatly improved. Experimental results show that the proposed method can greatly reduce the training time without significant performance degradation.
One-sentence Summary: This paper proposes a multi-stage layerwise training method to accelerate the training of BERT model.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Reviewed Version (pdf): https://openreview.net/references/pdf?id=oE6YZEbyxx
12 Replies

Loading