Abstract: While local-window self-attention performs notably in
vision tasks, it suffers from limited receptive field and weak
modeling capability issues. This is mainly because it performs self-attention within non-overlapped windows and
shares weights on the channel dimension. We propose MixFormer to find a solution. First, we combine local-window
self-attention with depth-wise convolution in a parallel design, modeling cross-window connections to enlarge the receptive fields. Second, we propose bi-directional interactions across branches to provide complementary clues in
the channel and spatial dimensions. These two designs are
integrated to achieve efficient feature mixing among windows and dimensions. Our MixFormer provides competitive results on image classification with EfficientNet and
shows better results than RegNet and Swin Transformer.
Performance in downstream tasks outperforms its alternatives by significant margins with less computational costs
in 5 dense prediction tasks on MS COCO, ADE20k, and
LVIS. Code is available at https://github.com/
PaddlePaddle/PaddleClas.
0 Replies
Loading