Abstract: Deformable medical image registration is a crucial task that involves extracting and aligning features from two images to establish precise correspondence, essentially for accurate registration. While visual transformers have propelled recent advancements in medical image analysis, training and inference with Transformers can become excessively computationally expensive, particularly due to the quadratic complexity of self-attention when handling long sequences of representations. This challenge becomes more pronounced in 3D medical image registration tasks. To tackle this issue, we propose an efficient Hierarchical Pyramid Converter for medical image registration. The proposed approach firstly capitalizes on the observation that early self-attention layers in Transformers mainly emphasize local patterns, though with limited benefits. Specifically, we employ the plain multi-layer perceptrons (MLP), i.e., Spatial shift MLP (S-MLP), in the early stages of feature extraction. This module employs a spatial offset operation to facilitate communication between patches, encoding rich local patterns and effectively reducing computational expenses. We further propose a sparse Transformer block that adaptively selects and preserves the most valuable self-attention values for feature extraction. We introduce a learnable top-k selection operator, allowing the model to selectively retain attention scores that contribute the most to each query keyword. This innovation significantly enhances feature extraction in later stages. We conducted extensive evaluations using publicly available datasets, and the experimental results confirm that our proposed method achieves state-of-the-art performance in deformable medical image registration tasks.
Loading