Abstract: With the rapid scaling of large language models (LLMs), structured pruning has become a widely used technique to learn efficient, smaller models from larger ones, delivering superior performance compared to training similarly sized models from scratch. In this paper, we move beyond the traditional static pruning approach of determining a fixed pruning mask for a model, and propose a dynamic approach to structured pruning. In our method, the pruning mask is input-dependent and adapts dynamically based on the information described in a user instruction. Our approach, termed "instruction-following pruning'', introduces a sparse mask predictor that takes the user instruction as input and dynamically selects the most relevant model parameters for the given task. To identify and activate effective parameters, we jointly optimize the sparse mask predictor and the LLM, leveraging both instruction-following data and the pre-training corpus. Experimental results demonstrate the effectiveness of our approach on a wide range of evaluation benchmarks. For example, our 3B activated model improves over the 3B dense model by 5-8 points of absolute margin on domains such as math and coding, and rivals the performance of a 9B model.
Lay Summary: Modern LLMs are powerful but often too large to run efficiently on devices like laptops or phones. One common way to make them smaller is model pruning - removing parts of the model that aren’t essential. But these pruned models are usually fixed. However, the fixed nature of the pruned model poses challenges in real-world inference scenarios, where tasks can vary significantly.
We propose a smarter approach: dynamically decide which parts of the LLM should be used based on the user’s request. For example, if the prompt is about programming, the model activates the parts that are good at code. If it’s a math problem, it chooses different parts. This technique, which we call Instruction-Following Pruning (IFPruning), saves memory and speeds up the model without sacrificing much performance.
IFPruning achieves strong performance, often beating models of the same reduced size, and stays close to the original full-size model. It also leads to interpretable patterns, where similar tasks activate similar parts of the model — helping us better understand what the model uses to think. Finally, it brings significant speedups, cutting generation time by over 40% while keeping performance high.
Primary Area: Deep Learning->Large Language Models
Keywords: Large language model, model pruning, contextual sparsity, pre-training, fine-tuning
Submission Number: 13949
Loading