Abstract: Despite the significant recent success of Vision Transformers in computer vision, they struggle with dense prediction tasks due to their limited capability in local feature extraction and feature scales. Most existing variants of Vision Transformers handling these challenges introduce additional computational overhead. Therefore, we present a Dynamic Kernel with Gaussian Fusion Transformer named DGFormer, tailored for dense prediction tasks, to facilitate interpretation of complex spatial relationships and local patch-level feature interactions while reducing computational complexities. The DGFormer framework incorporates an encoder-decoder structure and our two innovative attention modules: 1) A Dynamic Kernel Attention mechanism, which dynamically optimizes representations for different samples and enhances local feature expressiveness, and 2) a Gaussian Kernel Fusion strategy that efficiently improves feature fusion precision and adaptability. We evaluate the performance of DGFormer across various dense prediction datasets, different frameworks, and different metrics, demonstrating substantial improvements in both model performance and computational efficiencies. We hope DGFormer offers new insights for dense prediction backbone designs and facilitates future research.
Loading