Fusion Pruning for Large Language Models

Shixin Jiang, Ming Liu, Bing Qin

Published: 2024, Last Modified: 20 May 2025ISCSLP 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Large language models have achieved great success in natural language processing tasks. It has recently become a new research hot spot. For example, in tasks such as mathematical reasoning and story writing, large models have emerged with extremely strong capabilities. However, their huge size and computing requirements have brought great challenges to ac-tual deployment. In terms of reasoning speed, as the model size increases significantly, the model's reasoning speed will drop a lot. Therefore, it is necessary to prune and accelerate large models. Existing structured and unstructured pruning methods have problems in compatibility and are not fully applicable to large models after pruning. Although these pruning methods are effective in theory, they usually show different applicability and effects when applied to complex models. For example, structured pruning methods may be more suitable for achieving model compression through sparse word embedding matrices or reducing the number of attention heads, while unstructured pruning methods focus more on pruning redundant parameter connections. However, these methods often lack sufficient compatibility and general applicability in practice. We mainly explore the research on fusion algorithms of pruning methods, including fusion acceleration solutions that combine structured pruning and unstructured pruning, as well as fusion acceleration solutions that combine pruning and other acceleration methods.