Abstract: Traffic sign detection is essentially important for intelligent driving. Existing detection algorithms typically incorporate self-attention mechanisms to model the dependencies among image elements, such as patches or pixels. When using a patch as a token, the positional information within the patch could be lost. When using a pixel as a token, an increase in the number of tokens can lead to a significant increase in computational complexity. To balance these two extreme situations, a pixel only needs to focus on pixels from the surrounding area. Therefore, we propose a local attention module termed Pixel-wise Spatial Feature Enhancement (PSFE), which uses pixels as tokens to enhance the spatial information of feature maps, and each pixel’s self-attention only acts on a local region to reduce computational complexity. Furthermore, we design a Bidirectional Res2Net (BR) module that generates multiple feature maps with different channel numbers from an input feature map, and then restores them to one feature map with the original input size through bidirectional fusion, greatly enriching the receptive field information contained in the feature map. We conducted experiments on the GTSDB, TT100K, and CCTSDB 2021 datasets to comprehensively evaluate our method, and the experimental results showed that our method has superior performance.
Loading