BinaryFormer: 1-bit self-attention for long-range transformers in medical image segmentation and 3D diffusion models

Mattias P Heinrich

BinaryFormer: 1-bit self-attention for long-range transformers in medical image segmentation and 3D diffusion models

Mattias P Heinrich

Published: 27 Mar 2025, Last Modified: 01 Jun 2025MIDL 2025 OralEveryoneRevisionsBibTeXCC BY 4.0

Keywords: self-attention, differentiable binarisation, diffusion, energy efficiency

TL;DR: A differentiable binarisation scheme for 1-bit training and inference of self-attention for very long sequences is proposed and evaluated.

Abstract: Vision transformers excel at capturing long-range interactions and have become essential for many medical image analysis tasks. Their computational cost, however, grows quadratically with sequence length - which is problematic for certain 3D problems, e.g. high-resolution diffusion models that require dozens of sampling steps. Flash attention addressed some limitations by optimising local memory access, but left the computational burden high. Quantising weights and activations for convolutions and fully binary networks are possible, but have to be trained at higher precision and often resulted in performance drops. For transformers recent studies have been limited to quantising weights in linear layers or exploiting the potential of sparsity in self-attention scores. We present a novel scheme that not only enables a binary precision computation of the self-attention at inference time but also extends this to the training of transformers. To achieve differentiability we combine the bitwise Hamming distance with a learnable scalar query and key weighting. In theory this yields a 16-32x more resource-efficiency in arithmetic operations and memory bandwidth. We evaluate our model on three tasks with sequence lengths of N>1000: classification of images without patch-embedding, semantic 2D MRI segmentation and 3D high-resolution diffusion models for inpainting and synthesis. Our results demonstrate competitive performance and we provide an intuitive reasoning for the effectiveness of differentiable key-, query- weighting through Bernoulli sampling and distance interpolation. https://github.com/mattiaspaul/BinaryFormer

Primary Subject Area: Generative Models

Secondary Subject Area: Segmentation

Paper Type: Methodological Development

Registration Requirement: Yes

Reproducibility: https://github.com/mattiaspaul/BinaryFormer

Midl Latex Submission Checklist: Ensure no LaTeX errors during compilation., Created a single midl25_NNN.zip file with midl25_NNN.tex, midl25_NNN.bib, all necessary figures and files., Includes \documentclass{midl}, \jmlryear{2025}, \jmlrworkshop, \jmlrvolume, \editors, and correct \bibliography command., Did not override options of the hyperref package, Did not use the times package., All authors and co-authors are correctly listed with proper spelling and avoid Unicode characters., Author and institution details are de-anonymized where needed. All author names, affiliations, and paper title are correctly spelled and capitalized in the biography section., References must use the .bib file. Did not override the bibliographystyle defined in midl.cls. Did not use \begin{thebibliography} directly to insert references., Tables and figures do not overflow margins; avoid using \scalebox; used \resizebox when needed., Included all necessary figures and removed *unused* files in the zip archive., Removed special formatting, visual annotations, and highlights used during rebuttal., All special characters in the paper and .bib file use LaTeX commands (e.g., \'e for é)., Appendices and supplementary material are included in the same PDF after references., Main paper does not exceed 9 pages; acknowledgements, references, and appendix start on page 10 or later.

Latex Code: zip

Copyright Form: pdf

Submission Number: 120

Loading