Keywords: large language models, compression, evolutionary algorithms, quantization, pruning
TL;DR: We propose an optimal approach for heterogeneous compression of large language models via pruning, quantization, or layer dropping.
Abstract: The high computational costs of large language models (LLMs) have led to
a flurry of research on LLM compression, via methods such as quantization,
sparsification, or structured pruning. A new frontier in this area is given by
dynamic, non-uniform compression methods, which adjust the compression
levels (e.g., sparsity) per-block or even per-layer in order to minimize accuracy
loss, while guaranteeing a global compression threshold. Yet, current methods
rely on heuristics for identifying the “importance” of a given layer towards the
loss, based on assumptions such as error monotonicity, i.e. that the end-to-end
model compression error is proportional to the sum of layer-wise errors. In this
paper, we revisit this area, and propose a new and general approach for dynamic
compression that is provably optimal in a given input range. We begin from
the motivating observation that, in general, error monotonicity does not hold for
LLMs: compressed models with lower sum of per-layer errors can perform worse
than models with higher error sums. To address this, we propose a new general
evolutionary framework for dynamic LLM compression called EvoPress, which
has provable convergence, low sample and evaluation complexity. We show that
these theoretical guarantees lead to highly competitive practical performance
for dynamic compression of Llama, Mistral and Phi models: via EvoPress, we
set new state-of-the-art results for structural pruning (block/layer dropping),
unstructured sparsity, as well as quantization with dynamic bitwidths.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 13864
Loading