Valid oversampling schemes to handle imbalance

Young-geun Kim, Yongchan Kwon, Myunghee Cho Paik

Published: 01 Jul 2019, Last Modified: 05 Sept 2025Pattern Recognition LettersEveryoneCC BY 4.0

Abstract: An imbalance is one of the problems in machine learning. When data are not balanced, the correct specification rate for the minor class suffers even if accuracy is high. The oversampling method has been used to address the issue without consideration about the sacrifice of accuracy. In addition, an arbitrary oversampling scheme may introduce bias. In this paper, we propose principled methods of handling imbalance under user-specified constraints on the sensitivity and specificity. Our work consists of three elements of contributions. First, we provide an optimized target proportion that minimizes the maximum error rate under user-specified constraints on sensitivity and specificity. Second, we introduce the notion of resampling at random (RAR) under which the limit of the estimator is not altered from the original sample. These two elements are relevant to any classification methods. Third, we derive asymptotic properties of selected classifiers when we apply RAR oversampling with the target proportion. Finally, we implement the proposed method in an image recognition context using the extracted feature from the last layer of deep convolutional neural networks (CNNs). We present an analysis of fundus data to classify diabetic retinopathy using the proposed method.