Caterpillar: A Pure-MLP Architecture with Shifted-Pillars-Concatenation

Published: 20 Jul 2024, Last Modified: 04 Aug 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Modeling in Computer Vision has evolved to MLPs. Vision MLPs naturally lack local modeling capability, to which the simplest treatment is combined with convolutional layers. Convolution, famous for its sliding window scheme, also suffers from this scheme of redundancy and lower parallel computation. In this paper, we seek to dispense with the windowing scheme and introduce a more elaborate and parallelizable method to exploit locality. To this end, we propose a new MLP module, namely Shifted-Pillars-Concatenation (SPC), that consists of two steps of processes: (1) Pillars-Shift, which generates four neighboring maps by shifting the input image along four directions, and (2) Pillars-Concatenation, which applies linear transformations and concatenation on the maps to aggregate local features. SPC module offers superior local modeling power and performance gains, making it a promising alternative to the convolutional layer. Then, we build a pure-MLP architecture called Caterpillar by replacing the convolutional layer with the SPC module in a hybrid model of sMLPNet. Extensive experiments show Caterpillar's excellent performance on both small-scale and ImageNet-1k classification benchmarks, with remarkable scalability and transfer capability possessed as well. The code is available at https://github.com/sunjin19126/Caterpillar.
Primary Subject Area: [Generation] Multimedia Foundation Models
Secondary Subject Area: [Content] Vision and Language
Relevance To Conference: Image classification is a crucial task in multimedia processing, which plays a significant role in multimedia understanding, indexing, retrieval, analysis, and quality assessment. Accurate classification enables us to identify objects, scenes, and concepts in images, serving as the basis for higher-level multimedia analysis. This study introduces Caterpillar, a powerful tool for image classification. Caterpillar is a pure MLP-based deep architecture, which utilizes MLP to capture local and global information separately, achieving comparable or superior performance compared to state-of-the-art methods on both large and small-scale image classification benchmarks. These results demonstrate the capability of Caterpillar to learn complex patterns and extract high-level representations from various multimedia data, making it highly applicable in diverse multimedia applications and services. The core idea of Caterpillar is the Shifted-Pillars-Concatenation (SPC) module, which replaces traditional convolution for more effective aggregation of local information. Leveraging the advancements of convolutional neural networks in multimedia processing, such as visual feature extraction, multimodal fusion, and cross-modal retrieval. We anticipate that integrating the SPC module with efficient techniques, like depth-wise settings, will reduce computational costs and further improve module performance. Additionally, we look forward to exploring the Caterpillar on other tasks, such as detection and segmentation, particularly in data-hungry domains
Supplementary Material: zip
Submission Number: 455
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview