MG-ViT: A Multi-Granularity Method for Compact and Efficient Vision Transformers

Published: 21 Sept 2023, Last Modified: 16 Jan 2024NeurIPS 2023 posterEveryoneRevisionsBibTeX
Keywords: Efficient AI, Vision Transformer, Image Classification, Multi-Granularity, Three-Way Decisions.
TL;DR: This paper proposes a two-stage multi-granularity method, MG-ViT, to balance ViT's remarkable performance and huge computational cost.
Abstract: Vision Transformer (ViT) faces obstacles in wide application due to its huge computational cost. Almost all existing studies on compressing ViT adopt the manner of splitting an image with a single granularity, with very few exploration of splitting an image with multi-granularity. As we know, important information often randomly concentrate in few regions of an image, necessitating multi-granularity attention allocation to an image. Enlightened by this, we introduce the multi-granularity strategy to compress ViT, which is simple but effective. We propose a two-stage multi-granularity framework, MG-ViT, to balance ViT’s performance and computational cost. In single-granularity inference stage, an input image is split into a small number of patches for simple inference. If necessary, multi-granularity inference stage will be instigated, where the important patches are further subsplit into multi-finer-grained patches for subsequent inference. Moreover, prior studies on compression only for classification, while we extend the multi-granularity strategy to hierarchical ViT for downstream tasks such as detection and segmentation. Extensive experiments Prove the effectiveness of the multi-granularity strategy. For instance, on ImageNet, without any loss of performance, MG-ViT reduces 47\% FLOPs of LV-ViT-S and 56\% FLOPs of DeiT-S.
Submission Number: 6032
Loading