Abstract: High-performance deep learning methods typically rely on
large annotated training datasets, which are difficult to obtain in many
clinical applications due to the high cost of medical image labeling. Existing
data assessment methods commonly require knowing the labels in
advance, which are not feasible to achieve our goal of ‘knowing which
data to label.’ To this end, we formulate and propose a novel and efficient
data assessment strategy, EXponentiAl Marginal sINgular valuE
(EXAMINE) score, to rank the quality of unlabeled medical image data
based on their useful latent representations extracted via Self-supervised
Learning (SSL) networks. Motivated by theoretical implication of SSL
embedding space, we leverage a Masked Autoencoder [8] for feature extraction.
Furthermore, we evaluate data quality based on the marginal
change of the largest singular value after excluding the data point in the
dataset. We conduct extensive experiments on a pathology dataset. Our
results indicate the effectiveness and efficiency of our proposed methods
for selecting the most valuable data to label.
0 Replies
Loading