Abstract: Auto-regressive Large Language Models (LLMs)
demonstrate remarkable performance across different domains such as vision and language processing. However, due to sequential processing through a stack of transformer layers, autoregressive decoding faces significant computation/latency challenges, particularly in resource-constrained environments like mobile and edge devices. Existing approaches in literature that aim to improve latency via skipping layers have two distinct flavors - 1) Early exit, and 2) Input-agnostic heuristics where tokens exit at pre-determined layers irrespective of input sequence. Both the above strategies have limitations - the former cannot be applied to handle KV Caching necessary for speed-ups in modern framework and the latter does not capture the variation in layer importance across tasks or more generally, across input sequences.
To address both limitations, we propose \textsc{FiRST}, an algorithm that reduces inference latency by using layer-specific routers to select a subset of transformer layers adaptively for each input sequence - the prompt (during the prefill stage) decides which layers will be skipped during decoding. \textsc{FiRST} preserves compatibility with KV caching enabling faster inference while being quality-aware. \textsc{FiRST} is model-agnostic and can be easily enabled on any pre-trained LLM.
Our approach reveals that input adaptivity is critical - indeed, different task-specific middle layers play a crucial role in evolving hidden representations depending on tasks. Extensive experiments show that \textsc{FiRST} significantly reduces latency while retaining competitive performance (as compared to baselines), making our approach an efficient solution for LLM deployment in low-resource environments.
Paper Type: Long
Research Area: Machine Learning for NLP
Research Area Keywords: Layer Skipping, Latency Reduction, KV Caching, Input Adaptivity
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches low compute settings-efficiency
Languages Studied: English, German, Chinese
Submission Number: 2299
Loading