Keywords: Compression, Pruning, Transformer, Efficiency
Abstract: Transformers become ubiquitous across vision and language tasks, but their depth and parameter count often far exceed what is needed for a given downstream application, leading to unnecessary compute and memory overhead. Existing layer‐pruning techniques either require multiple retraining cycles, rely on continuous relaxations that never fully deactivate blocks, or depend on architecture‐specific analyses. We introduce STCP, a model‐agnostic, single‐pass pruning framework that learns binary gates over each block’s multi-head self-attention (MHSA) and MLP sub-layers in a pretrained transformer. We optimize gates while also injecting noise and introducing an $L_1$ penalty: this allows us to escape from local minima, and to find sparser circuits. We validate STCP on both image classification and NLP tasks with large pretrained models, showing good trade-offs in terms of complexity and performance. The code will be made publicly available upon acceptance of the article.
Supplementary Material: zip
Primary Area: optimization
Submission Number: 12362
Loading