Learning from biased positive-unlabeled data via threshold calibration

Published: 22 Jan 2025, Last Modified: 06 Mar 2025AISTATS 2025 OralEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Learning from positive and unlabeled data (PU learning) aims to train a binary classification model when only positive and unlabeled examples are available. Typically, learners assume that there is a labeling mechanism that determines which positive labels are observed. A particularly challenging setting arises when the observed positive labels are a biased sample from the positive distribution. Current approaches either require estimating the propensity scores, which are the instance-specific probabilities that a positive example's label will be observed, or make overly restricting assumptions about the labeling mechanism. We make a novel assumption about the labeling mechanism which we show is more general than several commonly used existing ones. Moreover, the combination of our novel assumption and theoretical results from robust statistics can simplify the process of learning from biased PU data. Empirically, our approach offers superior predictive and run time performance compared to the state-of-the-art methods.
Submission Number: 779
Loading