Keywords: early exiting, machine learning, adaptive computation, large language models
Abstract: Large Language Models (LLMs) have shown impressive results across the board, but inference can be costly.
A promising solution is posed by early exiting methods that assume that not all tokens need the same amount of computation, exiting the LLM at earlier layers.
Several early exiting methods have been proposed, which rely on the implicit assumption that as the network does more computation, it will become more confident in its prediction.
We investigate this assumption for two early exiting methods and propose three new confidence measures for early exiting based on the insights.
We find early evidence for monotonicity benefitting the quality of token generation.
Submission Number: 26
Loading