Keywords: Vendor imbalance, vendor balancing, deep learning, mammography
TL;DR: This extended abstract reports on results with the use of oversampling as a means to address vendor imbalance in training mammography CNN models.
Abstract: Machine learning initiatives in the medical domain are often restricted by the data that is available. In mammography, especially cancerous imaging data is typically difficult and costly to acquire. As a result, data imbalance plays a relatively major role, in contrast with general image recognition projects where large curated image databases are available. Quite some research exists on the class imbalance problem, which plays a role in many domains. Here, in contrast, we focus on an imbalance problem more specifically tied to the medical domain: vendor imbalance. Various approaches for dealing with imbalanced data are available in general. Here, we report on a case study of the effect of over-sampling as an approach to deal with vendor imbalance. We consider CNN training for soft-tissue lesion detection in mammography. A sequence of over-sampling configurations are compared, representing a gradual shift from no balancing, where data from each vendor is sampled proportionally to its abundance, to full balancing, where all data is sampled uniformly. Contrary to our expectations, for this learning problem it is found that the average performance across the manufacturers is maximal when no balancing is used.
Code Of Conduct: I have read and accept the code of conduct.
Remove If Rejected: Remove submission from public view if paper is rejected.