NViT: Vision Transformer Compression and Parameter Redistribution

Huanrui Yang; Hongxu Yin; Pavlo Molchanov; Hai Li; Jan Kautz

NViT: Vision Transformer Compression and Parameter Redistribution

Huanrui Yang, Hongxu Yin, Pavlo Molchanov, Hai Li, Jan Kautz

29 Sept 2021 (modified: 22 Jun 2025)ICLR 2022 Conference Withdrawn SubmissionReaders: Everyone

Keywords: Vision transformer, structual pruning, latency aware, novel architecture

Abstract: Transformers yield state-of-the-art results across many tasks. However, they still impose huge computational costs during inference. We apply global, structural pruning with latency-aware regularization on all parameters of the Vision Transformer (ViT) model for latency reduction. Furthermore, we analyze the pruned architectures and find interesting regularities in the final weight structure. Our discovered insights lead to a new architecture called NViT (Novel ViT), with a redistribution of where parameters are used. This architecture utilizes parameters more efficiently and enables control of the latency-accuracy trade-off. On ImageNet-1K, we prune the DEIT-Base (Touvron et al., 2021) model to a 2.6$\times$ FLOPs reduction, 5.1$\times$ parameter reduction, and 1.9$\times$ run-time speedup with only 0.07% loss in accuracy. We achieve more than 1% accuracy gain when compressing the base model to the throughput of the Small/Tiny variants. NViT gains 0.1-1.1% accuracy over the hand-designed DEIT family when trained from scratch, while being faster.

One-sentence Summary: This work applies global structural pruning on ViT models and finds novel efficient architectures

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 2 code implementations](https://www.catalyzex.com/paper/nvit-vision-transformer-compression-and/code)

27 Replies

Loading