Structured Transformer Circuits Pruning

Zhu LIAO; Van-Tam Nguyen; Enzo Tartaglione

Structured Transformer Circuits Pruning

Zhu LIAO, Van-Tam Nguyen, Enzo Tartaglione

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Compression, Pruning, Transformer, Efficiency

Abstract: Transformers become ubiquitous across vision and language tasks, but their depth and parameter count often far exceed what is needed for a given downstream application, leading to unnecessary compute and memory overhead. Existing layer‐pruning techniques either require multiple retraining cycles, rely on continuous relaxations that never fully deactivate blocks, or depend on architecture‐specific analyses. We introduce STCP, a model‐agnostic, single‐pass pruning framework that learns binary gates over each block’s multi-head self-attention (MHSA) and MLP sub-layers in a pretrained transformer. We optimize gates while also injecting noise and introducing an $L_1$ penalty: this allows us to escape from local minima, and to find sparser circuits. We validate STCP on both image classification and NLP tasks with large pretrained models, showing good trade-offs in terms of complexity and performance. The code will be made publicly available upon acceptance of the article.

Supplementary Material: zip

Primary Area: optimization

Submission Number: 12362

Loading