Keywords: constraint optimization, model pruning, fine-tuning, llm
TL;DR: We propose a method that simultaneously fine-tunes and prunes LLMs using constrained optimization, achieving 1.88× faster inference at 50% sparsity with negligible performance loss compared to traditional two-stage approaches.
Abstract: Fine-tuning large language models (LLMs) for specific downstream tasks enables exceptional performance; unfortunately, the vast model sizes hinder deployment in hardware-constrained environments. Hence, small, domain-specific models are created for such scenarios. This is usually done in a two-stage process by first pruning a LLM and fine-tuning (FT) afterwards.
However, performing these two steps jointly may yield better results, as both FT and pruning can then adapt to each other.
Motivated by this potential, we propose a method based on constrained optimization that uses augmented Lagrangian methods to simultaneously fine-tune and prune (SFP) LLMs to a target sparsity.
Our approach is directly compatible with parameter-efficient fine-tuning (PEFT) techniques and can be applied to structures of different granularities.
We evaluate the effectiveness of the method against state-of-the-art pruning techniques and show similar or better performance. Specifically, SFP can prune a 7 billion parameter model to 50\% sparsity and achieve a 1.88 times faster inference speed with negligible performance degradation.
Submission Number: 57
Loading