Abstract: Model generalisability, i.e. performance on multiple unseen datasets, can be improved by training on large volumes of annotated data, from which models can learn diverse representations. However, annotated medical data is limited due to the scarcity of expertise. In this work, we present an efficient data sampling pipeline to select DIVerse and bAlanced images (DataDIVA) from image pools to maximise model generalisability in retinal imaging. Specifically, we first extract image feature embeddings using off-the-shelf foundation models and generate embedding clusters. We then evenly sample images from those diverse clusters and train a model. We run the trained model on the whole unlabelled image pool and sample the remaining images from those classified as rare categories. This pipeline aims to sample the retinal images with diverse representations and mitigate the unbalanced distribution. We show that DataDIVA consistently improved the model performance in both internal and external evaluation, on six public datasets, with clinically meaningful tasks of referable diabetic retinopathy and glaucoma detection. The code is available at https://doi.org/10.5281/zenodo.12674694.
Loading