Not all Layers of LLMs are Necessary during Inference

Not all Layers of LLMs are Necessary during Inference

ACL ARR 2024 June Submission5299 Authors

16 Jun 2024 (modified: 08 Aug 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: The inference phase of Large Language Models (LLMs) is very expensive. An ideal inference stage of LLMs could utilize fewer computational resources while still maintaining its capabilities (e.g., generalization and in-context learning ability). In this paper, we try to answer the question, “During LLM inference, can we use shallow layers for easy instances; and deep layers for hard ones?” To answer this question, we first indicate that Not all Layers are Necessary during Inference by statistically analyzing the activated layers across tasks. Then, we propose a simple algorithm named AdaInfer to determine the inference termination moment based on the input instance adaptively. More importantly, AdaInfer does not alter LLM parameters and maintains generalizability across tasks. Experiments on well-known LLMs (i.e., Llama2 series and OPT) show that AdaInfer can achieve an average of 17.8% pruning ratio, even up to 43% on sentiment tasks while maintaining comparable performance with minimal loss (<1%). Additionally, this method is orthogonal to other model acceleration techniques, potentially boosting inference efficiency further.

Paper Type: Long

Research Area: Efficient/Low-Resource Methods for NLP

Research Area Keywords: Conditional Computation, large language models,

Contribution Types: Approaches to low-resource settings, Approaches low compute settings-efficiency

Languages Studied: english

Submission Number: 5299

Loading