MTVNet: Multi-Contextual Transformers for Volumes – Network for Super-Resolution with Long-Range Interactions
Keywords: Super-Resolution, 3D image processing, low-level vision
TL;DR: We introduce MTVNet, a volumetric transformer that expands the receptive field through multi-scale contextual modeling to overcome memory constraints in 3D super-resolution.
Abstract: Recent advances in transformer-based models have led to significant improvements in 2D image super-resolution. However, leveraging these advances for volumetric super-resolution remains challenging due to the high memory demands of self-attention mechanisms in 3D volumes, which severely limit the receptive field. As a result, long-range interactions, one of the key strengths of transformers, are underutilized in 3D super-resolution. To investigate this, we propose MTVNet, a volumetric transformer model that leverages information from expanded contextual regions at multiple resolution scales. Here, coarse resolution information from boarder context regions is carried on to inform the super-resolution prediction of a smaller area. Using transformer layers at each resolution, our coarse-to-fine modeling limits the number of tokens at each scale and enables attention over larger regions than previously possible. We compare our method, MTVNet, against state-of-the-art models on five 3D datasets. Our results show that expanding the receptive field of transformer-based methods yields significant performance gains on high-resolution 3D data. While CNNs outperform transformers on low-resolution data, transformer-based methods excel on high-resolution volumes with exploitable long-range dependencies, with our MTVNet achieving state-of-the-art performance. Our code is available at link.
Serve As Reviewer: ~August_Leander_Høeg1
Submission Number: 37
Loading