Abstract: Transformer displays the impressive capabilities on vision tasks. The built-in self-attention retains the quadratic computation burden in respect of the spatial resolution of image features. The traditional downsampling (e.g., average pooling) can reduce the resolution. Nonetheless, it may suffer from the dropping of detailed information. In this work, we propose an Efficient Wavelet Attention (EWA), which injects the wavelet transform and a Mean GELU (MGELU) function. Firstly, the wavelet transform enables the detailed information to participate in the efficient interaction modeling. Secondly, MGELU regards the statistical mean as reference and loosely passes the high relative responses. Building upon EWA, we present an effective Semantic-aware Wavelet Transformer (SWFormer), which is then employed for pyramid learning, including CNN feature hierarchy or Region of Interest (RoI) features. For the feature hierarchy, a Pyramid SWFormer (PSWFormer) incorporates SWFormer at each level to fit the bidirectional features. For RoIs, a Recognition-Localization SWFormer (RLSWFormer) is inserted into the head to fit their features from all levels. The effectiveness of our SWFormer is displayed experimentally on the MS COCO detection dataset and the Pascal VOC dataset. When exploiting Swin-small backbone, our SWFormer-based method acquires AP of 52.1 in the single-scale evaluation on the COCO test-dev set.
External IDs:dblp:journals/tmm/LiJLLLC25a
Loading