Abstract: The inference phase of Large Language Models (LLMs) is very expensive. An ideal inference stage of LLMs could utilize fewer computational resources while still maintaining its capabilities (e.g., generalization and in-context learning ability). In this paper, we try to answer the question, “During LLM inference, can we use shallow layers for easy instances; and deep layers for hard ones?” To answer this question, we first indicate that Not all Layers are Necessary during Inference by statistically analyzing the activated layers across tasks. Then, we propose a simple algorithm named AdaInfer to determine the inference termination moment based on the input instance adaptively. More importantly, AdaInfer does not alter LLM parameters and maintains generalizability across tasks. Experiments on well-known LLMs (i.e., Llama2 series and OPT) show that AdaInfer can achieve an average of 17.8% pruning ratio, even up to 43% on sentiment tasks while maintaining comparable performance with minimal loss (<1%). Additionally, this method is orthogonal to other model acceleration techniques, potentially boosting inference efficiency further.
Paper Type: Long
Research Area: Efficient/Low-Resource Methods for NLP
Research Area Keywords: Conditional Computation, large language models,
Contribution Types: Approaches to low-resource settings, Approaches low compute settings-efficiency
Languages Studied: english
Submission Number: 5299
Loading