D-Net: Dynamic Large Kernel with Dynamic Feature Fusion for Volumetric Medical Image Segmentation

Published: 15 May 2024, Last Modified: 30 Sept 2024OpenReview Archive Direct UploadEveryoneCC BY 4.0
Abstract: Hierarchical Vision Transformers (ViT) have achieved significant success in medical image segmentation due to their large receptive field and capabilities of leveraging long-range contextual information. Convolutional neural networks (CNNs) may also deliver a large receptive field by using large convolutional kernels. However, due to the employment of fixed-sized kernels, CNNs incorporated with large kernels remain limited in their ability to adaptively capture multi-scale features from organs with large variations in shape and size. They are also unable to utilize global contextual information efficiently. To address these limitations, we propose lightweight Dynamic Large Kernel (DLK) and Dynamic Feature Fusion (DFF) modules. The DLK employs multiple large kernels with varying kernel sizes and dilation rates to capture multi-scale features. Subsequently, a dynamic selection mechanism is utilized to adaptively highlight the most important channel and spatial features based on global information. The DFF is proposed to adaptively fuse multi-scale local feature maps based on their global information. We construct a DLK-Net for medical image segmentation by incorporating DLK and DFF into a hierarchical ViT architecture. Large convolutional kernels are incorporated into hierarchical ViT architectures to utilize their scaling behavior, but they are unable to sufficiently extract low-level features due to feature embedding in ViT architectures. To tackle this limitation, we propose a Salience layer to extract low-level features from images at their original dimensions without feature embedding. This Salience layer employs a Channel Mixer to effectively capture global representations. We incorporated DLK, DFF, and the Salience layer into a hierarchical ViT architecture to develop a novel architecture, termed D-Net. D-Net can effectively utilize a multi-scale large receptive field and adaptively harness global contextual information. To further demonstrate the superiority of our DLK, we incorporated it into a widely used hybrid CNN-ViT architecture to build the DLK-NETR. We apply these three models, including DLK-Net, D-Net, and DLK-NETR, to three volumetric segmentation tasks, and extensive experimental results demonstrate their superior segmentation performance compared to state-of-the-art models, with comparably lower computational complexity.
Loading