CT-Mixer: Exploiting Multiscale Design for Local-Global Representations Learning

Published: 01 Jan 2023, Last Modified: 09 Feb 2025ICA3PP (1) 2023EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Convolutional neural networks (ConvNets) have been widely used for feature extraction in various computer vision tasks, such as image classification, object detection, and instance segmentation. Recently, Vision Transformers (ViTs) have demonstrated exceptional performance on upstream vision tasks, such as image classification, due to their effectiveness in modeling long-range dependencies. However, ViTs’ performance is limited by weak inductive biases in modeling two-dimensional data and under-explored multi-scale representation learning, both of which are fundamental strengths of ConvNets. In this paper, we introduce CT-Mixer, a novel architecture that combines convolution-based and transformer-based modules to exploit the advantages of both and compensate for their respective weaknesses. The CT-Mixer architecture cross-stacks convolution-based and transformer-based modules, where each module is assigned an order to better learn local information and global context. Additionally, we incorporate a dynamic mechanism into the convolution-based modules to model adaptive dependence. We also improve the multi-scale representation learning strategy by adopting the multi-branch structure of MPViT. Experimental results demonstrate that CT-Mixer achieves competitive performance compared to existing methods.
Loading