WaveFormer: Leveraging Wavelet Transformation for Multi-Scale Token Interactions in Hierarchical Transformers

MD Mahfuz Al Hasan; Mahdi Zaman; Alberto Santamaria-Pang; Abdul Jawad; Ho Hin Lee; Antika Roy; Yaser Fallah; Ivan Tarapov; Navid Asadi; Reza Forghani

WaveFormer: Leveraging Wavelet Transformation for Multi-Scale Token Interactions in Hierarchical Transformers

MD Mahfuz Al Hasan, Mahdi Zaman, Alberto Santamaria-Pang, Abdul Jawad, Ho Hin Lee, Antika Roy, Yaser Fallah, Ivan Tarapov, Navid Asadi, Reza Forghani

27 Sept 2024 (modified: 25 Nov 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Transformer, Attention Mechanism, Receptive Field, Discrete Wavelet Transformation, Parseval's Theorem

Abstract: Recent transformer models have achieved state-of-the-art performance for visual tasks involving high-dimensional data like 3D volumetric medical image segmentation. Hierarchical transformers (e.g., Swin Transformers) circumvent the computational challenge of the self-attention mechanism through a shifted window approach to learn token relations within progressively overlapping local regions, thus expanding the receptive field across layers while limiting token attention span in each layer within predefined windows. In this work, we introduce a novel learning paradigm that captures token relations through progressive summarization of features. We leverage the compaction capability of discrete wavelet transform (DWT) on high-dimensional features and learn token relation in multi-scale approximation coefficients obtained from DWT. This approach efficiently represents fine-grained local to coarse global contexts within each network layer. Furthermore, computing self-attention on the DWT-transformed features significantly reduces the computational complexity, effectively addressing the challenges posed by high-dimensional data in vision transformers. Our proposed network, termed WaveFormer, competes favorably with current SOTA transformers (e.g., SwinUNETR) using three challenging public datasets on volumetric medical imaging: (1) MICCAI Challenge 2021 FLARE, (2) MICCAI Challenge 2019 KiTS, and (3) MICCAI Challenge 2022 AMOS. WaveFormer consistently outperforms Swin-UNETR, improving from 0.929 to 0.938 Dice (FLARE2021) and 0.880 to 0.900 Dice (AMOS2022). In addition, we explore the WaveFormer’s effectiveness in segmenting organs of varying sizes, demonstrating its robustness across different anatomical structures. The source code will be available with supplementary materials in the complete paper submission.

Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 12418

Loading