COLAFormer: Communicating local-global features with linear computational complexity

Zhengwei Miao, Hui Luo, Meihui Li, Jianlin Zhang

Published: 01 Jan 2025, Last Modified: 05 Mar 2025Pattern Recognit. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Highlights•We introduce a novel backbone called COLAFormer, which is based on Concatenating global tokens in Local Attention (COLA). COLA communicates between local attention and sparse attention following a concatenating manner. Additionally, it achieves linear complexity and can significantly contribute to downstream tasks.•In order to provide high-quality global information for COLA, we introduce an effective downsampling module called learnable condensing feature (LCF) module. LCF can perform a function similar to clustering and downsample tokens.•Extensive experiments across various vision tasks demonstrate that our COLAFormer achieves impressive classification accuracy on ImageNet1K and strikes a balance between performance and resource consumption on downstream tasks at large input resolutions such as object detection.