Keywords: Large Language Models, Adaptive Pruning, Gradient-steered Search
TL;DR: An adaptive LLM pruning framework that uses an encoder-evaluator-decoder approach to optimize pruning for runtime adaptability.
Abstract: Deploying Large Language Models (LLMs) at the edge is crucial for data privacy and offline operation, yet their massive parameter count poses significant resource challenges. While existing methods rely on discrete-space heuristics to search for pruning configurations, we introduce a fundamentally different approach: reformulating the search for optimal LLM pruning configurations as gradient optimization in a learned continuous representation space. Our method, ALPS (Adaptive Layer Pruning via Search), embeds discrete pruning configurations into a continuous space where efficient gradient-based optimization becomes possible, then decodes optimal representations back to implementable discrete pruning schemes. This encoder-evaluator-decoder architecture automatically learns from collected “pruning-score" data pairs, eliminating manual tuning while jointly optimizing for model performance, latency, and energy consumption in a deployment-specific manner. Extensive experiments across Llama-7B, Llama2-7B, Llama2-13B, and Vicuna-7B demonstrate ALPS's superiority, achieving up to 34.1% energy reduction and 33.5\% lower latency while maintaining over 91% of original performance. At high pruning ratios (50%), ALPS consistently outperforms state-of-the-art methods in both perplexity and downstream task accuracy.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 23899
Loading