Convolution on Your 12× Wide Feature: A ConvNet with Nested Design

Jiahao Wang; Songyang Zhang; Haodong Duan; Zhaohui Yu; Mengzhang Li; Yong Liu; Taiqiang Wu; Xihui Liu; Kai Chen; Dahua Lin; Ping Luo

Convolution on Your 12× Wide Feature: A ConvNet with Nested Design

Jiahao Wang, Songyang Zhang, Haodong Duan, Zhaohui Yu, Mengzhang Li, Yong Liu, Taiqiang Wu, Xihui Liu, Kai Chen, Dahua Lin, Ping Luo

20 Sept 2023 (modified: 25 Mar 2024)ICLR 2024 Conference Withdrawn SubmissionEveryoneRevisionsBibTeX

Keywords: convolution on 12× wide high-dimensional feature, pure ConvNet with nested design, vision backbone

TL;DR: In the wave of modern ConvNets adopting ViTs, a successful innovation and exploration of the block architecture for ConvNets.

Abstract: Transformer stands as the prefered architecture for handling multimodal data under resource-abundant conditions. On the other hand, in scenarios involving resource-constrained unimodal vision tasks, Convolutional Neural Networks (ConvNets), especially smaller-scale ones, can offer a hardware-friendly solution due to the highly-optimized acceleration and deployment schemes tailored for convolution operators. Modern de-facto ConvNets take a ViT-style block-level design, i.e., sequential design with token mixer and MLP. However, this design choice seems more influenced by the prominence of Transformer in multi-modal domains than by an inherent suitability within ConvNet. In this work, we suggest allocating more proportion of computational resources to spatial convolution layers, and further summarize 3 guidelines to steer such ConvNet design. Specifically, we observe that convolution on 12× wide high dimensional features aids in expanding the receptive field and capturing rich spatial information, and correspondingly devise a ConvNet model with nested design, dubbed ConvNeSt. ConvNeSt outperforms ConvNeXt in the ImageNet classification, COCO detection and ADE20K segmentation tasks across different model variants, demonstrating the feasibility of revisiting ConvNet block design. As a small-scale student model, ConvNeSt also achieves stronger performance than ConvNeXt through knowledge distillation.

Supplementary Material: zip

Primary Area: representation learning for computer vision, audio, language, and other modalities

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 2398

Loading