Attention Is All You Need But You Don’t Need All Of It For Inference of Large Language Models

Published: 18 Jun 2024, Last Modified: 03 Jul 2024TF2M 2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: model pruning, large language models
TL;DR: We find that dropping deeper attention layers only marginally decreases performance
Abstract: The inference demand for LLMs has skyrocketed in recent months, and serving models with low latencies remains challenging due to their size and quadratic input length complexity. In this work, we investigate the effect of dropping various layers at inference time on the performance of LLama2 models. We find that dropping deeper attention layers, which we call \emph{inference-time attention removal} (ITAR), only marginally decreases performance. For example, removing 33\% of attention layers in a 13B Llama2 model results in a 0.9\% drop in average performance over the OpenLLM benchmark (ARC, HellaSwag, TruthfulQA). Removing attention sublayers lead to a lower drop in performance and bigger runtime improvements than removing the feed-forward sublayers.
Submission Number: 69
Loading