AdaInfer: Instance-aware Adaptive Inference for LLMs

Anonymous

AdaInfer: Instance-aware Adaptive Inference for LLMs

Anonymous

16 Feb 2024 (modified: 18 Jun 2024)ACL ARR 2024 February Blind SubmissionReaders: Everyone

Abstract: Large Language Models (LLMs) inference phase is very expensive. An ideal inferenceLLM should utilize fewer computational resources while still maintaining its capabilities in generalization and in-context learning ability. In this paper, we try to answer the question, “During LLM inference, can we use shallow layers for easy input; deep layers for hard ones?” To answer this question, we first indicate that Not all Layers are Necessary at inference time by statistically analyzing the activated layers across tasks. Then, we propose a simple algorithm named AdaInfer for instance-aware adaptive inference, which determines the inference termination time based on the input instance itself. More importantly, AdaInfer does not alter LLM parameters and maintains generalizability across tasks. Experiments on well-known LLMs (i.e., Llama2-7B/13B and OPT-13B ) show that AdaInfer can save 10% to 50% of computational resources on mainstream tasks (e.g., knowledge-based/common-sense QA, text classification). Meanwhile, maintaining accuracy with average minimal (<1%) loss. Additionally, this method is orthogonal to other model acceleration techniques (e.g., sparse and flash attention), offering the potential for further enhancing inference efficiency. Code and data is available at Anomynous Github.

Paper Type: long

Research Area: Efficient/Low-Resource Methods for NLP

Contribution Types: Approaches to low-resource settings, Approaches low compute settings-efficiency

Languages Studied: English

0 Replies

Loading