Selective Depth Attention Networks for Adaptive Multiscale Feature Representation

Qingbei Guo; Xiaojun Wu; Tianyang Xu; Tongzhen Si; Cong Hu; Jinglan Tian

Selective Depth Attention Networks for Adaptive Multiscale Feature Representation

Qingbei Guo, Xiaojun Wu, Tianyang Xu, Tongzhen Si, Cong Hu, Jinglan Tian

Published: 01 Jan 2024, Last Modified: 13 Nov 2024IEEE Trans. Artif. Intell. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Existing multiscale methods lead to a risk of just increasing the receptive field sizes while neglecting small receptive fields. Thus, it is a challenging problem to effectively construct adaptive neural networks for recognizing various spatial-scale objects. To tackle this issue, we first introduce a new attention dimension, i.e., depth, in addition to existing attentions such as channel-attention, spatial-attention, branch-attention, and self-attention. We present a novel selective depth attention network to treat multiscale objects symmetrically in various vision tasks. Specifically, the blocks within each stage of neural networks, including convolutional neural networks (CNNs), e.g., ResNet, SENet, and Res2Net, and vision transformers (ViTs), e.g., PVTv2, output the hierarchical feature maps with the same resolution but different receptive field sizes. Based on this structural property, we design a depthwise building module, namely an selective depth attention (SDA) module, including a trunk branch and a SE-like attention branch. The block outputs of the trunk branch are fused to guide their depth attention allocation through the attention branch globally. According to the proposed attention mechanism, we dynamically select different depth features, which contributes to adaptively adjusting the receptive field sizes for the variable-sized input objects. Moreover, our method is orthogonal to multiscale networks and attention networks, so-called SDA- $x$ Net. Extensive experiments demonstrate that the proposed SDA method significantly improves the original performance as a lightweight and efficient plug-in on numerous computer vision tasks, e.g., image classification, object detection, and instance segmentation.

Loading