EvoPress: Towards Optimal Dynamic Model Compression via Evolutionary Search

Oliver Sieberling; Denis Kuznedelev; Eldar Kurtic; Dan Alistarh

EvoPress: Towards Optimal Dynamic Model Compression via Evolutionary Search

Oliver Sieberling, Denis Kuznedelev, Eldar Kurtic, Dan Alistarh

28 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: large language models, compression, evolutionary algorithms, quantization, pruning

TL;DR: We propose an optimal approach for heterogeneous compression of large language models via pruning, quantization, or layer dropping.

Abstract: The high computational costs of large language models (LLMs) have led to a flurry of research on LLM compression, via methods such as quantization, sparsification, or structured pruning. A new frontier in this area is given by dynamic, non-uniform compression methods, which adjust the compression levels (e.g., sparsity) per-block or even per-layer in order to minimize accuracy loss, while guaranteeing a global compression threshold. Yet, current methods rely on heuristics for identifying the “importance” of a given layer towards the loss, based on assumptions such as error monotonicity, i.e. that the end-to-end model compression error is proportional to the sum of layer-wise errors. In this paper, we revisit this area, and propose a new and general approach for dynamic compression that is provably optimal in a given input range. We begin from the motivating observation that, in general, error monotonicity does not hold for LLMs: compressed models with lower sum of per-layer errors can perform worse than models with higher error sums. To address this, we propose a new general evolutionary framework for dynamic LLM compression called EvoPress, which has provable convergence, low sample and evaluation complexity. We show that these theoretical guarantees lead to highly competitive practical performance for dynamic compression of Llama, Mistral and Phi models: via EvoPress, we set new state-of-the-art results for structural pruning (block/layer dropping), unstructured sparsity, as well as quantization with dynamic bitwidths.

Supplementary Material: zip

Primary Area: foundation or frontier models, including LLMs

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 13864

Loading