Keywords: Early Exit; Retrieval Augmentation; Large Language Model
Abstract: Deploying large language model inference remains challenging due to their high computational overhead.
Early exit optimizes model inference by adaptively reducing the number of inference layers.
Current methods typically train internal classifiers to determine whether to exit at intermediate layers.
However, such classifier-based early exit frameworks require significant effort to train the classifiers while can only achieve comparable performance at best.
To address these limitations, this paper proposes RAEE, a robust Retrieval-Augmented Early Exit framework for efficient inference.
This paper first demonstrates that the early exit problem can be effectively modeled as a distribution prediction problem, in which the distribution is approximated through the exit information of similar data.
Subsequently, it outlines the methodology for collecting exit information to construct the retrieval database.
Finally, leveraging the pre-constructed retrieval database, RAEE utilizes the exit information from retrieved similar data to guide the backbone model's exit at the layer.
Experimental results demonstrate that RAEE significantly accelerates inference while achieving robust zero-shot performance across eight downstream tasks.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 5968
Loading