Abstract: BERT and other pre-trained language models (PLMs) are ubiquitous in the modern NLP. Even though PLMs are the state-of-the-art (SOTA) models for almost every NLP task \citep{Qiu2020PretrainedMF}, the significant latency during inference forbids more widely industrial usage. In this work, we propose \underline{P}atient and \underline{C}onfident \underline{E}arly \underline{E}xiting BERT (PCEE-BERT), an off-the-shelf sample-dependent early exiting method that can work with different PLMs and can also work along with popular model compression methods. With a multi-exit BERT as the backbone model, PCEE-BERT will make the early exiting decision if enough numbers (patience parameter) of consecutive intermediate layers are confident about their predictions. The entropy value measures the confidence level of an intermediate layer's prediction. Experiments on the GLUE benchmark demonstrate that our method outperforms previous SOTA early exiting methods. Ablation studies show that: (a) our method performs consistently well on other PLMs, such as ALBERT and TinyBERT; (b) PCEE-BERT can make achieve different speed-up ratios by adjusting the patience parameter and the confidence threshold.
Paper Type: long
0 Replies
Loading