Abstract: Highlights•We introduce a novel backbone called COLAFormer, which is based on Concatenating global tokens in Local Attention (COLA). COLA communicates between local attention and sparse attention following a concatenating manner. Additionally, it achieves linear complexity and can significantly contribute to downstream tasks.•In order to provide high-quality global information for COLA, we introduce an effective downsampling module called learnable condensing feature (LCF) module. LCF can perform a function similar to clustering and downsample tokens.•Extensive experiments across various vision tasks demonstrate that our COLAFormer achieves impressive classification accuracy on ImageNet1K and strikes a balance between performance and resource consumption on downstream tasks at large input resolutions such as object detection.
Loading