Keywords: structured pruning, depth compression, layer pruning, layer merging, Transformer, LLM
TL;DR: We propose a novel depth compression methods for LLMs, FlattenGPT, which employs layer flattening to bridge the gap between layer pruning and channel pruning.
Abstract: This work proposes FlattenGPT, a novel fine-grained depth compression method for transformers. Recent works have observed redundancy across transformer blocks, prompting the research of depth compression to prune less crucial blocks. However, such entire-block pruning risks discarding knowledge learned in those blocks, leading to serious performance degradation. On the other hand, channel pruning can better preserve performance, while it cannot compress model depth and is challenged by inconsistent pruning ratios for each layer. To address this issue, our method introduces a novel operation named layer flattening, which bridges the gap between layer pruning and channel pruning. By converting two adjacent blocks into one, it compresses the network depth and enables fine-grained parameter removal. FlattenGPT strives to preserve the knowledge learned in all blocks and remain consistent with the original architecture, enhancing model efficiency with a decent trade-off to performance. Extensive experiments demonstrate that FlattenGPT outperforms existing pruning methods in both zero-shot accuracies and WikiText-2 perplexity across various model types and parameter sizes. It also outperforms other pruning methods in accelerating LLM inference, making it a promising approach for enhancing the efficiency of transformers.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 11797
Loading