SI-BiViT: Binarizing Vision Transformers with Spatial Interaction

Peng Yin; Xiaosu Zhu; Jingkuan Song; Lianli Gao; Heng Tao Shen

SI-BiViT: Binarizing Vision Transformers with Spatial Interaction

Peng Yin, Xiaosu Zhu, Jingkuan Song, Lianli Gao, Heng Tao Shen

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Binarized Vision Transformers (BiViTs) aim to facilitate the efficient and lightweight utilization of Vision Transformers (ViTs) on devices with limited computational resources. Yet, the current approach to binarizing ViT leads to a substantial performance decrease compared to the full-precision model, posing obstacles to practical deployment. By empirical study, we reveal that spatial interaction (SI) is a critical factor that impacts performance due to lack of token-level correlation, but previous work ignores this factor. To this end, we design a ViT binarization approach dubbed SI-BiViT to incorporate spatial interaction in the binarization process. Specifically, an SI module is placed alongside the Multi-Layer Perceptron (MLP) module to formulate the dual-branch structure. This structure not only leverages knowledge from pre-trained ViTs by distilling over the original MLP, but also enhances spatial interaction via the introduced SI module. Correspondingly, we design a decoupled training strategy to train these two branches more effectively. Importantly, our SI-BiViT is orthogonal to existing Binarized ViTs approaches and can be directly plugged. Extensive experiments demonstrate the strong flexibility and effectiveness of SI-BiViT by plugging our method into four classic ViT backbones in supporting three downstream tasks, including classification, detection, and segmentation. In particular, SI-BiViT enhances the classification performance of binarized ViTs by an average of 10.52\% in Top-1 accuracy compared to the previous state-of-the-art. The code will be made publicly available.

Primary Subject Area: [Content] Vision and Language

Secondary Subject Area: [Systems] Systems and Middleware, [Content] Vision and Language

Relevance To Conference: In recent years, transformer-based models have undergone rapid development, demonstrating significant success across various language, vision, and multimodal tasks. However, the increasing number of parameters and computation requirements in these models present a challenge for hardware resources. Especially in mobile or embedded devices, such as smartphones, the computational performance and storage capacity may not suffice, thereby impeding their deployment on edge devices. This paper delves into the binarization of transformer models, a technique that substantially reduces their resource overhead, thus facilitating their practical deployment. With the increasing deployment of AI applications on smartphones and other mobile devices, we will have greater access to the convenience offered by vision and multimodal models in our daily lives. Previous work in ACM-MM papers such as "Towards Accurate Post-Training Quantization for Vision Transformer," "VQ-DcTr: Vector-Quantized Autoencoder With Dual-channel Transformer Points Splitting for 3D Point Cloud Completion," and "LGViT: Dynamic Early Exiting for Accelerating Vision Transformer," have also explored model compression and quantization techniques in transformer models.

Supplementary Material: zip

Submission Number: 696

Loading