Inception MLP: A vision MLP backbone for multi-scale feature extraction

Published: 01 Jan 2025, Last Modified: 26 Jul 2025Inf. Sci. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Recently, MLP-based networks have demonstrated remarkable performance in computer vision with simple but efficient structures. However, most existing MLP architectures struggle to balance the modeling of local and global regional information and often rely on static token mixing matrices for information fusion, disregarding the distinctiveness of different input contents. In this study, we propose inception MLP (iMLP), which employs multiple cross-MLP branches with varying receptive field sizes to simultaneously capture short-range and long-range dependencies. Meanwhile, the channel partition ratio γ is dynamically adjusted to better align with model characteristics as the network deepens. In addition, considering the diversity of input contents, we incorporate a lightweight, content-adaptive module to enable dynamic and efficient feature fusion. Experimental results demonstrate the versatility of iMLP as a competitive vision backbone across various visual tasks. For instance, our iMLP-S achieves 82.1% top-1 accuracy on the ImageNet-1K classification benchmark with only 20M parameters and extremely high throughput, outperforming state-of-the-art MLP-based models with a better trade-off between accuracy and computational efficiency.
Loading