TL;DR: We propose an evolutionary optimization procedure for heterogeneous compression of large language models via pruning, quantization, or layer dropping.
Abstract: The high computational costs of large language models (LLMs) have led to a flurry of research on LLM compression, via methods such as quantization, sparsification, or structured pruning. A new frontier in this area is given by dynamic, non-uniform compression methods, which adjust the compression levels (e.g., sparsity) per-block or even per-layer in order to minimize accuracy loss, while guaranteeing a global compression threshold.
Yet, current methods rely on estimating the "importance" of a given layer, implicitly assuming that layers contribute independently to the overall compression error.
We begin from the motivating observation that this independence assumption does not generally hold for LLM compression: pruning a model further may even significantly recover performance.
To address this, we propose EvoPress, a novel evolutionary framework for dynamic LLM compression. By formulating dynamic compression as a general optimization problem, EvoPress identifies optimal compression profiles in a highly efficient manner, and generalizes across diverse models and compression techniques. Via EvoPress, we achieve state-of-the-art performance for dynamic compression of Llama, Mistral, and Phi models, setting new benchmarks for structural pruning (block/layer dropping), unstructured sparsity, and quantization with dynamic bitwidths.
Lay Summary: Machine learning models that generate text have become extremely large and expensive to run. This had led researchers to come up with ways to shrink these models. However, most of these methods compress all parts of the model by the same amount, even though some can be shrunk more than others.
To fix this, we developed EvoPress, an evolutionary approach that figures out exactly how much to shrink each part of the model. EvoPress works in steps: it starts with a compressed model, then in each round it tests slightly modified versions (mutations), and picks the one that works best (selection). EvoPress can be used on top of any compression technique and on different text-generation models.
Through this procedure, EvoPress sets a new benchmark for finding high-quality models that fit within a size limit. This means that large language models can run more efficiently, which makes AI tools more accessible to everyone.
Link To Code: https://github.com/IST-DASLab/EvoPress
Primary Area: Deep Learning->Large Language Models
Keywords: large language models, compression, quantization, pruning, evolutionary algorithms
Submission Number: 10304
Loading