Keywords: Convolutional Neural Network, Vision Backbone, Lightweight, Fast
Abstract: Traditional Convolutional Neural Networks (CNNs) tend to use $3\times 3$ small kernels, but can only capture limited neighboring spatial information.
Inspired by the success of Vision Transformers (ViTs) in capturing long-range visual dependencies, recent CNNs have reached a consensus on utilizing large kernel convolutions (e.g., astonishingly, 111 kernel).
Nevertheless, these approaches are unfriendly to hardware, imposing a serious computation burden on training or inference.
This paper introduces a Simple and Fast Convolutional Neural Network (SFCNN) that employs a sequence of stacked $3\times 3$ convolutions but surpasses state-of-the-art CNNs with larger kernels.
In particular, we build a thin and deep model, which encourages more $3\times 3$ convolutions to capture more spatial information under the limited computing complexity rather than opting for a heavier and shallower architecture.
To further enlarge the receptive field, we redesign the traditional inverted residual bottleneck with two $3\times 3$ depthwise convolutions.
In addition, we propose a novel Global Sigmoid Linear Unit (GSiLU) activation function to capture global coarse-grained spatial information.
Our SFCNN performs better than state-of-the-art CNNs and ViTs on various tasks, including ImageNet-1K image classification, COCO instance segmentation, and ADE20K semantic segmentation.
It also has good scalability and outperforms existing state-of-the-art lightweight models.
All materials containing codes and logs have been included in the supplementary materials.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 6406
Loading