The Devil is in Details: Delving Into Lite FFN Design for Vision Transformers

Zhiyang Chen, Yousong Zhu, Zhaowen Li, Fan Yang, Chaoyang Zhao, Jinqiao Wang, Ming Tang

Published: 01 Jan 2024, Last Modified: 05 Nov 2025ICASSP 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Transformer has demonstrated exceptional performance on a variety of vision tasks. However, its high computational complexity can become problematic. In this paper, we conduct a systematic analysis of the complexity of each component in vision transformers, and identify an easily overlooked detail: that the Feed-Forward Network (FFN) is the primary computational bottleneck, even more so than the Multi-Head Self-Attention (MHSA) mechanism. Inspired by this, we further propose a lightweight FFN module, named SparseFFN, that can reduce dense computations in both channel and spatial dimension. Specifically, SparseFFN consists of two components: Channel-Sparse FFN (CS-FFN) and Spatial-Sparse FFN (SS-FFN), which can be seamlessly incorporated into various vision transformers and even pure MLP models with significantly fewer FLOPs. Extensive experiments demonstrate the effectiveness and efficiency of the proposed method. For example, our approach can reduce model complexity by 23%-39% for most of vision transformers and MLP models while keeping comparable accuracy.