Keywords: Model Compression; Large Language Models; Explainable ML;
Abstract: Large language models (LLMs) demonstrate unprecedented capabilities across diverse applications, yet their extensive parameterization creates substantial computational and memory requirements that hinder practical deployment.
While structured pruning shows promise for LLM compression, existing methods use static masks that cannot adapt to different inputs, limiting performance across diverse tasks.
In this work, we present \textsc{SeAP}, a novel semantic-aware structured pruning framework that adaptively identifies optimal masks based on input semantics at the pre-fill stage. Our framework features two key components: (1) an explainability-guided importance estimation that uniquely fuses local and global neuron importance to discover diverse representative mask patterns from calibration data's intrinsic characteristics, and (2) a lightweight router-based module through iterative refinement that efficiently assigns optimal masks for each input prompt. Experimental results on LLaMA-2/3, Qwen2, and Phi-2 demonstrate that \textsc{SeAP} outperforms state-of-the-art structured pruning methods across diverse language modeling and commonsense reasoning tasks, achieving competitive performance with reductions in memory and inference latency.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 7789
Loading