Pre-training on high-resolution X-ray images: an experimental study

Published: 01 Jan 2025, Last Modified: 21 Oct 2025Vis. Intell. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Existing X-ray image based pre-trained vision models are typically trained on a relatively small-scale dataset (less than 500,000 samples) with limited resolution (e.g., \(224 \times 224\)). However, the key to the success of self-supervised pre-training of large models lies in massive training data, and the maintenance of high-resolution X-ray images contributes to effective solutions for some challenging diseases. In this paper, we proposed a high-resolution (\(1280 \times 1280\)) X-ray image based pre-trained baseline model on our newly collected large-scale dataset containing more than 1 million X-ray images. Our model employs the masked auto-encoder framework, wherein the tokens that have been processed with a high rate are used as input, and the masked image patches are reconstructed by means of the Transformer encoder-decoder network. More importantly, a novel context-aware masking strategy has been introduced. This strategy utilizes the breast contour as a boundary for adaptive masking operations. We validate the effectiveness of our model through its application in two downstream tasks, namely X-ray report generation and disease detection. Extensive experiments demonstrate that our pre-trained medical baseline model can achieve comparable to, or even exceed, those of current state-of-the-art models on downstream benchmark datasets.
Loading