Keywords: Model Compression, Neuron Summary, Weight Sharing, Large Language Models
TL;DR: We propose a novel learnable weight sharing method termed Neuron Summary (NS) for compression of LLMs by creating a compact representation of the weights in the linear layers in the LLMs.
Abstract: The rapid growth in the size of Large Language Models (LLMs) poses significant challenges for deployment, particularly in resource-limited environments. To address this issue, we propose Neuron Summary (NS), a novel approach for compressing LLMs by constructing compact representations of the weights in their linear layers. Given that these layers contribute the most to the overall model size, NS offers an effective method to reduce the model size and computational costs while maintaining strong performance in downstream natural language processing tasks. Our compressed model, NSNet, substitutes each linear layer in an LLM with an NS-Linear layer, where the weights are represented using NS. The transition from a pre-trained LLM to NSNet is achieved through regression-based initialization, followed by knowledge distillation to preserve the original model’s capabilities.
Extensive experiments on compressing various LLMs, including DeBERTaV3-base and Llama-2, demonstrate that NS significantly outperforms existing compression methods across multiple tasks, such as natural language understanding, question answering, and text generation. Additionally, NS is complementary to other compression techniques, such as quantization and layer-wise parameter sharing, enabling further reduction in model size while maintaining competitive performance. The code of NSNet is available at \url{https://anonymous.4open.science/r/NSNet-D6B8/}.
Submission Number: 2
Loading