Abstract: The Transformer model has demonstrated immense potential and significant importance as an efficient tool in the field of medical image analysis, primarily due to its capability to capture global context. However, its limitation in capturing local information to some extent constrains the full performance of Transformer in this domain. To mitigate this issue, we propose a novel medical image segmentation network, HIFNet, based on hierarchical attention feature fusion. Specifically, we utilize a pre-trained MaxViT as the encoder. In our newly constructed decoder, spatial attention is applied to feature maps of different sizes to focus more on critical regions of the input images. Additionally, we incorporate multiple attention mechanisms, including criss-cross attention, to capture sensitive spatial relationships within medical images. Furthermore, we employ coordinate attention in skip connections to embed positional information in different directions, thereby generating feature maps containing sensitive positional information. Experiments conducted on relevant medical image datasets demonstrate the effectiveness and scalability of our proposed encoder.
Loading