PG3D-ViT: A Prompt-Guided 3D Vision Transformer for Medical Image Classification

Juan Gong

Published: 29 Jan 2026, Last Modified: 28 Jan 2026OpenReview Archive Direct UploadEveryoneCC BY 4.0

Abstract: 3D medical image classification is challenging due to small, subtle lesions and substantial irrelevant context, which often mislead deep models. Inspired by the top-down diagnostic process of clinicians—first identifying anatomical context, then locating anomalies—we propose Prompt-Guided 3D Vision Transformer (PG3D-ViT), a framework that simulates clinical reasoning through prompt-driven attention. To address limited 3D training data, PG3D-ViT leverages 2D masked autoencoder (MAE) pretraining to learn transferable image features. Through the prompt generation module, consistency difference analysis is performed between normal and abnormal samples to extract anatomical structure and global spatial prompt information related to the lesion context. These prompts are injected as query into a cross-attention mechanism, guiding the model to focus on lesion-relevant regions across the 3D volume. Evaluated on 7 public datasets spanning multiple modalities and pathologies, PG3D-ViT achieves a 1.88\% average AUC improvement over state-of-the-art methods. The attention map visualizations demonstrate that the model can accurately localize lesion regions, validating the effectiveness of the clinical prompting mechanism in enhancing both the performance and interpretability of 3D medical image classification.