STADE: Standard Deviation as a Pruning Metric

Diego Coello de Portugal; Haya Alyoussef; Maximilian Stubbemann; Ilia Koloiarov; Tom Hanika; Lars Schmidt-Thieme

STADE: Standard Deviation as a Pruning Metric

Diego Coello de Portugal, Haya Alyoussef, Maximilian Stubbemann, Ilia Koloiarov, Tom Hanika, Lars Schmidt-Thieme

17 Sept 2025 (modified: 28 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Pruning, Machine Learning, Large Language Model, Deep Learning

TL;DR: We show a mathematical derivation of the pruning problem that leads to previous methods (Wanda) and we extend it to a more general case by proposing a new pruning method (STADE)

Abstract: Large Language Models (LLMs) have become very widespread and are used to solve a wide variety of tasks. To successfully handle many of these tasks, LLMs require longer training times and larger model sizes. This makes LLMs ideal candidates for pruning methods that reduce computational demands while main- taining performance. Previous methods require a retraining phase after pruning to maintain the original model’s performance. However, state-of-the-art pruning methods, such as Wanda, prune the model without retraining, making the pruning process faster and more efficient. Building upon Wanda’s work, this study pro- vides a theoretical explanation of why the method is effective and leverages these insights to enhance the pruning process. Specifically, a theoretical analysis of the pruning problem reveals a common scenario in Machine Learning where Wanda is the optimal pruning method. Furthermore, this analysis reveals cases where Wanda is no longer optimal. To tackle those cases, we develop a new method, STADE, based on the standard deviation of the input. From a theoretical and em- pirical standpoint, STADE demonstrates better generality across different scenar- ios. Finally, extensive experiments on Qwen, Llama and Open Pre-trained Trans- formers (OPT) models validate these theoretical findings, showing that depending on the training conditions, Wanda’s optimal performance varies as predicted by the theoretical framework. From a theoretical and empirical standpoint, STADE demonstrates better generality across different scenarios. Finally, extensive experiments on Qwen, Llama and Open Pre-trained Transformers (OPT) models validate these theoretical findings, showing that depending on the training conditions, Wanda's optimal performance varies as predicted by the theoretical framework. These insights contribute to a more robust understanding of pruning strategies and their practical implications.

Supplementary Material: zip

Primary Area: learning theory

Submission Number: 9118

Loading