TL;DR: We derive the layer-wise sparsity rate of LLMs through a theoretical perspective, which significantly enhances the performance of sparse LLMs.
Abstract: In this paper, we address the challenge of determining the layer-wise sparsity rates of large language models (LLMs) through a theoretical perspective. Specifically, we identify a critical issue of **"reconstruction error explosion"** in existing LLMs sparsification methods. This refers to the cumulative effect of reconstruction errors throughout the sparsification process, where errors from earlier layers propagate and amplify in subsequent layers. As a result, the overall reconstruction error increases significantly, leading to a substantial degradation in model performance. Through theoretical analysis, we derive a simple yet effective approach to layer-wise sparsity allocation that mitigates this issue. Our method uses a monotonically increasing arithmetic progression, reducing the process of determining sparsity rates for multiple layers to the determination of a single common difference hyperparameter. Remarkably, this allows for the optimal layer-wise sparsity rates to be identified with just a few trials. Both our theoretical analysis and experimental results demonstrate that this sparsity allocation scheme is near optimal. Extensive experiments show that our method significantly improves the performance of sparse LLMs across various architectures, outperforming existing layer-wise sparsity methods. Furthermore, it enhances the performance of various compression techniques and is applicable to vision and multimodal models. Notably, our method achieves a reduction of 52.10 in perplexity for the 70% sparse LLaMA2-7B model obtained via Wanda, improves average zero-shot accuracy by 10.50%, and delivers speedups of 2.63$\times$ and 2.23$\times$ on CPU and GPU, respectively. Code is available at https://github.com/wzhuang-xmu/ATP.
Lay Summary: Large language models like ChatGPT need heavy computing power. Researchers often simplify these models by removing less important components ("sparsification"). But current methods face a hidden problem: errors caused by this simplification accumulate across different layers of the model, like a snowball rolling downhill. These growing errors eventually crash the model’s performance – we call this "error explosion."
We discovered a smarter way to simplify these AI systems. Imagine organizing model layers like a musical crescendo – starting with minimal simplification in early layers and gradually increasing it in later ones. This approach prevents error accumulation and only requires adjusting one key parameter. Remarkably, finding the best pattern takes just a few attempts rather than exhaustive testing.
Our method makes simplified AI models significantly more accurate and efficient. When applied to a 7B model, it boosted task-solving accuracy by over 10% while making the model 70% leaner and doubled processing speed on both CPUs and GPUs. It also works for image and multimodal AI, enabling compact yet powerful models. For example, it could help run advanced AI assistants on everyday devices instead of energy-hungry servers. This breakthrough balances AI capability with real-world usability, making advanced models faster, greener, and more accessible.
Link To Code: https://github.com/wzhuang-xmu/ATP
Primary Area: Deep Learning->Large Language Models
Keywords: Large language models, Network Sparsity, Layerwise sparsity
Submission Number: 58
Loading