TL;DR: We propose a method for maximizing the AUC under covariate shift by using positive and unlabeled data in the training distribution and unlabeled data in the test distribution.
Abstract: Maximizing the area under the receiver operating characteristic curve (AUC) is a standard approach to imbalanced binary classification tasks. Existing AUC maximization methods typically assume that training and test distributions are identical. However, this assumption is often violated due to {\it a covariate shift}, where the input distribution can vary but the conditional distribution of the class label given the input remains unchanged. The importance weighting is a common approach to the covariate shift, which minimizes the test risk with importance-weighted training data. However, it cannot maximize the AUC. In this paper, to achieve this, we theoretically derive two estimators of the test AUC risk under the covariate shift by using positive and unlabeled (PU) data in the training distribution and unlabeled data in the test distribution. Our first estimator is calculated from importance-weighted PU data in the training distribution, and the second one is calculated from importance-weighted positive data in the training distribution and unlabeled data in the test distribution. We train classifiers by minimizing a weighted sum of the two AUC risk estimators that approximates the test AUC risk. Unlike the existing importance weighting, our method does not require negative labels and class-priors. We show the effectiveness of our method with six real-world datasets.
Lay Summary: Various practical applications, such as malware detection, medical diagnosis, and fault detection, are often formulated as binary classification problems with class-imbalance data in machine learning. AUC maximization is a standard approach for learning accurate classifiers from such imbalanced data. However, in real-world scenarios, data tendencies (i.e., data distributions) often differ between training and testing phases, leading to performance degradation when using classifiers trained with standard methods. In this study, we focus on the situation where the input distribution changes—known as covariate shift—and show that it is possible to maximize the test AUC using only positive and unlabeled data in the training phase and unlabeled data in the test phase. This implies that one can learn appropriate classifiers for imbalanced data while reducing the annotation cost for expensive samples. Our findings are expected to contribute to the development of more practical detection and diagnostic systems.
Primary Area: General Machine Learning->Transfer, Multitask and Meta-learning
Keywords: Distribution shift, Covariate shift, AUC, PU learning, Imbalanced data
Submission Number: 5248
Loading