ICViT: Integrated Framework for Complexity Reduction in Vision Transformer

Published: 2024, Last Modified: 16 Nov 2025MAPR 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: In recent years, Vision Transformer has emerged as a new trend in the field of computer vision, attracting considerable attention from the research community. However, their impressive performance went along with high computation complexity, attributed to both the attention mechanism and multi-layer perceptrons (MLPs), leading to resource-intensive training and inference processes. Numerous prior works have proposed various strategies to alleviate the complexity of ViT, many of which decomposing the self-attention mechanism to enhance performance. Building upon this foundation, we propose a novel architectural framework called ICViT. This architecture integrates several of these decomposition strategies to simultaneously improve performance and reduce the computational complexity of ViT. By leveraging techniques such as Hydra Attention, Linear Angular Attention, SimA Attention, and Class Attention, our approach demonstrates notable advancements compared to previous prior methodologies and the foundational ViT framework. For instance, our ICViT surpasses the original ViT 9.7% in accuracy on Cifar-10, 8.57% on Cifar-100, while utilizes 24.4% fewer parameters. We also integrate a special technique to our ICViT to help it learn the local communication called Local Patch Interation (LPI). Our findings underscore the efficacy of this multifaceted integration, offering promising avenues for future advancements in transformer-based architecture.
Loading