Reducing BERT Computation by Padding Removal and Curriculum Learning

Wei Zhang, Wei Wei, Wen Wang, Lingling Jin, Zheng Cao

Published: 2021, Last Modified: 08 Oct 2025ISPASS 2021EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: BERT [1] is very computationally expensive, which is a hurdle for its training and deployment. This work focuses on removing the unnecessary computation due to input padding in BERT. The input of BERT consists of two concatenated sentences. If the length of the two concatenated sentences is shorter than the maximum sequence length, padding must be added to the end of the sentences to fill the empty slots in the input. Because the lengths of sentences vary greatly, there can be a large amount of padding in input. For the English Wikipedia & BooksCorpus dataset, the percentage of padding among all the input tokens is 17% and 48%, respectively, when the max sequence length is set to 128 and 512. For the Chinese Wikipedia dataset, this percentage is 35% and 79%, respectively, when the max sequence length is 128 and 512. For SQuAD-v1.1 [2], padding accounts for 54% of the total input tokens when the max sequence length is 384. Thus, there is a lot of wasted computation on padding, which significantly increases the training and inference time.