Beyond Conventional Transformers: A Medical X-ray Attention Block for Improved Multi-Label Diagnosis

Amit Rand; Hadi Ibrahim

Beyond Conventional Transformers: A Medical X-ray Attention Block for Improved Multi-Label Diagnosis

Amit Rand, Hadi Ibrahim

Published: 09 Oct 2025, Last Modified: 29 Oct 2025NeurIPS 2025 Workshop ImageomicsEveryoneRevisionsBibTeXCC BY 4.0

Submission Track: Short papers presenting ongoing research or work submitted to other venues (up to 5 pages, excluding references)

Keywords: Biomedical Imaging, Vision Transformers, Domain-Specific Attention, Multi-Label Classification, Chest X-ray Diagnosis

TL;DR: We introduce MXA, a proof of concept for a domain-specific attention mechanism that boosts multi-label chest X-ray diagnosis while remaining lightweight and clinically deployable.

Abstract: Transformers have reshaped visual recognition through generic self-attention, yet their application to specialized domains like medical imaging remains underexplored. In this work, we introduce the Medical X-ray Attention (MXA) block, a domain-specific attention mechanism designed specifically for multi-label chest X-ray diagnosis. Unlike conventional attention modules, MXA augments transformer backbones with inductive priors tailored to radiology, including a lightweight region-of-interest pooling and CBAM-style channel–spatial attention, both integrated in parallel with multi-head self-attention. To reduce the computational burden of traditional transformers and support deployment in clinical settings, we embed MXA within an Efficient Vision Transformer (EfficientViT) and apply knowledge distillation from a calibrated DenseNet-121 teacher. This combined approach produces a model that is both accurate and resource-efficient. Our framework achieves 0.85 mean AUC on the CheXpert benchmark, representing a +0.19 absolute improvement and approximately 233\% relative improvement over chance-level performance (AUC = 0.5) compared to a vanilla EfficientViT baseline. These results demonstrate that attention modules can be overfit in a beneficial, task-aware sense to the unique structure and demands of clinical imaging. More broadly, we show that transformers do not need to remain generic, and that domain-specific attention can bridge the gap between expressive global modeling and real-world deployment.

Submission Number: 69

Loading