Keywords: Pruning, LLMs
Abstract: Structured pruning of Generative Pre-trained Transformers (GPTs) offers a promising path to efficient models, but often at the cost of performance degradation from discarded transformer blocks.
In this paper, we introduce FuseGPT, a compression paradigm that reframes structured pruning as knowledge redistribution rather than simple removal.
Instead of discarding less salient blocks, FuseGPT recycles them by fusing their knowledge into neighboring blocks, thereby preserving the model's performance.
Our approach has two core components.
First, we propose a fusion-aware importance metric, Macro Influence (MI), that identifies blocks not by their redundancy, but by their capacity to be effectively absorbed by other blocks.
Second, we introduce a learnable layers fusion mechanism that uses low-rank matrices to graft the knowledge from a pruned block onto its neighbors.
This process is guided by a lightweight, group-level fine-tuning procedure that uses a distillation-based loss to ensure the fused knowledge is properly integrated.
FuseGPT works for both large language and multimodal models, generally surpassing representative prior methods in perplexity and zero-shot task performance, using as few as 32 calibration and 1024 fine-tuning samples.
This ``prune-and-fuse'' approach opens a new avenue for model compression, focusing on repurposing rather than discarding valuable pre-trained knowledge.
Supplementary Material: pdf
Primary Area: foundation or frontier models, including LLMs
Submission Number: 1966
Loading