Reassessing Layer Pruning in LLMs: New Insights and Methods

Published: 26 Jan 2026, Last Modified: 11 Feb 2026ICLR 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Model, Layer Pruning, Model Compression
TL;DR: This paper presents a theoretical and empirical analysis of layer pruning in Large Language Models, aiming to improve and refine pruning strategies.
Abstract: Although large language models (LLMs) have achieved remarkable success across various domains, their considerable scale necessitates substantial computational resources, posing significant challenges for deployment in resource-constrained environments. Layer pruning, as a simple yet effective compression method, removes layers of a model directly, reducing computational overhead. However, what are the best practices for layer pruning in LLMs? Are sophisticated layer selection metrics truly effective? Does the LoRA (Low-Rank Approximation) family, widely regarded as a leading method for pruned model fine-tuning, truly meet expectations when applied to post-pruning fine-tuning? To answer these questions, we dedicate thousands of GPU hours to benchmarking layer pruning in LLMs and gaining insights across multiple dimensions. Our results demonstrate that a simple approach, i.e., pruning the final layers followed by fine-tuning the lm\_head and the remaining last three layers, yields remarkably strong performance. These pruning strategies are further supported by theoretical analyses based on the gradient flow. Following this guide, our method surpasses existing state-of-the-art pruning methods by $5.62\%$–$17.27\%$ on Llama-3.1-8B-It, by $2.36\%$–$19.45\%$ on Llama-3-8B and by $4.34\%$–$9.59\%$ on Llama-3-70B. The code is available at at https://github.com/yaolu-zjut/Navigation_LLM_layer_pruning.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 2804
Loading