A two-step anomaly detection based method for PU classification in imbalanced data sets

Carlos Ortega Vázquez; Seppe vanden Broucke; Jochen De Weerdt

A two-step anomaly detection based method for PU classification in imbalanced data sets

Carlos Ortega Vázquez, Seppe vanden Broucke, Jochen De Weerdt

Published: 01 Jan 2023, Last Modified: 05 Oct 2024Data Min. Knowl. Discov. 2023EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Several machine learning applications, including genetics and fraud detection, suffer from incomplete label information. In such applications, a classifier can only train from positive and unlabeled (PU) examples in which the unlabeled data consist of both positive and negative examples. Despite a substantial presence of PU learning in the literature, few works have considered a class imbalance setting. Hence, we propose a novel two-step method that exploits anomaly detection to identify hidden positives within the unlabeled data. Our method allows the end-user to choose the anomaly detector depending on preference or domain knowledge. Moreover, we introduce Nearest-Neighbor Isolation Forest (NNIF), a novel semi-supervised anomaly detector based on the Isolation Forest. In contrast to unsupervised anomaly detectors, NNIF can utilize all available label information. Empirical analysis shows that our method generally outperforms, using NNIF as the anomaly detector, state-of-the-art PU learning methods for imbalanced data sets under different labeling mechanisms. Further experiments suggest that our two-step method shows strong robustness to wrong class prior estimates.

Loading