Lost in Transformation: Current roadblocks for Transformers in 3D medical image segmentation

Saikat Roy; Tassilo Wald; Michael Baumgartner; Constantin Ulrich; Gregor Koehler; David Zimmerer; Fabian Isensee; Klaus Maier-Hein

Lost in Transformation: Current roadblocks for Transformers in 3D medical image segmentation

Saikat Roy, Tassilo Wald, Michael Baumgartner, Constantin Ulrich, Gregor Koehler, David Zimmerer, Fabian Isensee, Klaus Maier-Hein

22 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX

Supplementary Material: pdf

Primary Area: representation learning for computer vision, audio, language, and other modalities

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Keywords: representation learning, transformers, medical image segmentation, semantic segmentation, sparse datasets

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

Abstract: In the medical image segmentation domain, sparsely-annotated, limited datasets are common, posing a natural hurdle for Transformer-based segmentation networks. In this work, we systematically dissect 9 such popular Transformer networks on two representative organ and pathology segmentation datasets and explore whether Transformers are still beneficial under these challenging conditions. 1) We demonstrate that these Transformer-based segmentation networks frequently incorporate substantial convolutional backbones, which predominantly contribute to their performance, while Transformers themselves play a peripheral role. 2) Extending beyond accuracy, we analyze error and representational similarity to uncover architectures with underutilized Transformers, demonstrated by indiscernible change on both metrics without the Transformer. 3) We quantify the massive dataset size 'chasm' between medical and natural images, examine the impact of data reduction on performance, showing that Transformers bridge the performance gap to CNNs as the dataset size increases. 4) Additionally, we probe the importance of long-range interactions, showing that even limited receptive fields offer high performance in segmenting medical images, questioning the need for long-range interactions inherent to Transformers. In doing so, we identify significant challenges faced by major architectures employing Transformers for medical image segmentation, which may contribute to potential inefficiencies downstream in the domain.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 5390

Loading