Simultaneous Fine-Tuning and Pruning of LLMs

Finn Reinecke; Jörg K.H. Franke; Frank Hutter; Michael Hefenbrock

Simultaneous Fine-Tuning and Pruning of LLMs

Finn Reinecke, Jörg K.H. Franke, Frank Hutter, Michael Hefenbrock

Published: 22 Sept 2025, Last Modified: 01 Dec 2025NeurIPS 2025 WorkshopEveryoneRevisionsBibTeXCC BY 4.0

Keywords: constraint optimization, model pruning, fine-tuning, llm

TL;DR: We propose a method that simultaneously fine-tunes and prunes LLMs using constrained optimization, achieving 1.88× faster inference at 50% sparsity with negligible performance loss compared to traditional two-stage approaches.

Abstract: Fine-tuning large language models (LLMs) for specific downstream tasks enables exceptional performance; unfortunately, the vast model sizes hinder deployment in hardware-constrained environments. Hence, small, domain-specific models are created for such scenarios. This is usually done in a two-stage process by first pruning a LLM and fine-tuning (FT) afterwards. However, performing these two steps jointly may yield better results, as both FT and pruning can then adapt to each other. Motivated by this potential, we propose a method based on constrained optimization that uses augmented Lagrangian methods to simultaneously fine-tune and prune (SFP) LLMs to a target sparsity. Our approach is directly compatible with parameter-efficient fine-tuning (PEFT) techniques and can be applied to structures of different granularities. We evaluate the effectiveness of the method against state-of-the-art pruning techniques and show similar or better performance. Specifically, SFP can prune a 7 billion parameter model to 50\% sparsity and achieve a 1.88 times faster inference speed with negligible performance degradation.

Submission Number: 57

Loading