DuoFormer: Leveraging Hierarchical Visual Representations by Local and Global Attention

Xiaoya Tang; Bodong Zhang; Man M. Ho; Beatrice Knudsen; Tolga Tasdizen

DuoFormer: Leveraging Hierarchical Visual Representations by Local and Global Attention

Xiaoya Tang, Bodong Zhang, Man M. Ho, Beatrice Knudsen, Tolga Tasdizen

Published: 27 Mar 2025, Last Modified: 28 May 2025MIDL 2025 OralEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Vision Transformer · Inductive Bias · Multi-scale features.

Abstract:

Despite the widespread adoption of transformers in medical applications, the exploration of multi-scale learning through transformers remains limited, while hierarchical representations are thought to be advantageous for medical diagnosis. We propose a novel hierarchical transformer model that adeptly integrates the feature extraction capabilities of Convolutional Neural Networks (CNNs) with the advanced representational potential of Vision Transformers (ViTs). Addressing the lack of inductive biases and dependence on extensive training datasets in ViTs, our model employs a CNN backbone to generate hierarchical visual representations. These representations are adapted for transformer input through an innovative patch tokenization process, preserving the inherited multi-scale inductive biases. We also introduce a scale-wise attention mechanism that directly captures intra-scale and inter-scale associations. This mechanism complements patch-wise attention by enhancing spatial understanding and preserving global perception, which we refer to as local and global attention, respectively. Our model significantly outperforms baseline models in terms of classification accuracy, demonstrating its efficiency in bridging the gap between Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs). The components are designed as plug-and-play for different CNN architectures and can be adapted for multiple applications. The code is available at \href{https://github.com/xiaoyatang/DuoFormer.git}{https://github.com/xiaoyatang/DuoFormer.git}.

Primary Subject Area: Unsupervised Learning and Representation Learning

Secondary Subject Area: Detection and Diagnosis

Paper Type: Methodological Development

Registration Requirement: Yes

Reproducibility: https://github.com/xiaoyatang/DuoFormer.git}{https://github.com/xiaoyatang/DuoFormer.git

Midl Latex Submission Checklist: Ensure no LaTeX errors during compilation., Created a single midl25_NNN.zip file with midl25_NNN.tex, midl25_NNN.bib, all necessary figures and files., Includes \documentclass{midl}, \jmlryear{2025}, \jmlrworkshop, \jmlrvolume, \editors, and correct \bibliography command., Did not override options of the hyperref package, Did not use the times package., All authors and co-authors are correctly listed with proper spelling and avoid Unicode characters., Author and institution details are de-anonymized where needed. All author names, affiliations, and paper title are correctly spelled and capitalized in the biography section., References must use the .bib file. Did not override the bibliographystyle defined in midl.cls. Did not use \begin{thebibliography} directly to insert references., Tables and figures do not overflow margins; avoid using \scalebox; used \resizebox when needed., Included all necessary figures and removed *unused* files in the zip archive., Removed special formatting, visual annotations, and highlights used during rebuttal., All special characters in the paper and .bib file use LaTeX commands (e.g., \'e for é)., Appendices and supplementary material are included in the same PDF after references., Main paper does not exceed 9 pages; acknowledgements, references, and appendix start on page 10 or later.

Latex Code: zip

Copyright Form: pdf

Submission Number: 160

Loading