Fast and Robust Early-Exiting Framework for Autoregressive Language Models with Synchronized Parallel DecodingDownload PDF

01 Feb 2024OpenReview Archive Direct UploadReaders: Everyone
Abstract: To tackle the high inference latency exhibited by autoregressive language models, previous studies have proposed an early-exiting frame- work that allocates adaptive computation paths for each token based on the complexity of gen- erating the subsequent token. However, we observed several shortcomings, including per- formance degradation caused by a state copy- ing mechanism or numerous exit paths, and sensitivity to exit confidence thresholds. Con- sequently, we propose a Fast and Robust Early- Exiting (FREE) framework, which incorpo- rates a shallow-deep module and a synchro- nized parallel decoding. Our framework en- ables faster inference by synchronizing the de- coding process of the current token with previ- ously stacked early-exited tokens. Furthermore, as parallel decoding allows us to observe pre- dictions from both shallow and deep models, we present a novel adaptive threshold estimator that exploits a Beta mixture model to determine suitable confidence thresholds. We empirically demonstrated the superiority of our proposed framework on extensive generation tasks.
0 Replies

Loading