Positive Unlabeled Learning Selected Not At Random (PULSNAR): class proportion estimation when the SCAR assumption does not hold

TMLR Paper1371 Authors

12 Jul 2023 (modified: 17 Sept 2024)Rejected by TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Positive and Unlabeled (PU) learning is a type of semi-supervised binary classification where the machine learning algorithm differentiates between a set of positive instances (labeled) and a set of both positive and negative instances (unlabeled). PU learning has broad applications in settings where confirmed negatives are unavailable or difficult to obtain, and there is value in discovering positives among the unlabeled (e.g., viable drugs among untested compounds). Most PU learning algorithms make the \emph{selected completely at random} (SCAR) assumption, namely that positives are selected independently of their features. However, in many real-world applications, such as healthcare, positives are not SCAR (e.g., severe cases are more likely to be diagnosed), leading to a poor estimate of the proportion, α, of positives among unlabeled examples and poor model calibration, resulting in an uncertain decision threshold for selecting positives. PU learning algorithms can estimate α or the probability of an individual unlabeled instance being positive or both. We propose two PU learning algorithms to estimate α, calculate calibrated probabilities for PU instances, and improve classification metrics: i) PULSCAR (positive unlabeled learning selected completely at random), and ii) PULSNAR (positive unlabeled learning selected not at random). PULSNAR uses a divide-and-conquer approach that creates and solves several SCAR-like sub-problems using PULSCAR. In our experiments, PULSNAR outperformed state-of-the-art approaches on both synthetic and real-world benchmark datasets.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: As suggested by reviewers, we have made the following revisions to the paper: 1. In section 1, we modified the notation for the classifier in line 3 from $f: x-> y$ to $f : \mathcal{X} \rightarrow 0, 1$ 2. In section 1, we updated our contributions and kept three major contributions. 3. Related work (section 2) was modified as suggested. 4. Inconsistency with symbols $(x,y,s)$ was fixed in section 3.1 5. We added section 3.1.1 to explain the PU data assumptions our PULSCAR and PULSCAR algorithms make to handle one-sample and two-sample scenario. 6. We updated section 3.2 to fix the error related to the relationship between $f_p(x)$, $f_u(x)$ and $\alpha$. We showed why $\alpha f_p(x) <= f_u(x)$ holds and explained the intuition behind the selection of the objective function. 7. In equation 4, we changed the symbol for the function from $f$ to $h$. 8. In section 3.4, we removed the first line of equation 5 that showed the weighted sum of $\alpha$. 9. In section 4, we explained how we handled class imbalance in our experiments. 10. We added a new figure (Figure 4) to show the difference between estimated and true $\alpha$. We also increased the width of Figure 3 to make it clearer. 11. We updated section 6 to add more details to the discussion and conclusion to explain why our methods work better than other PU methods. 12. We also have a new section (section 7) for limitations. Note the requested additions slightly lengthened the manuscript beyond the originally-reviewed 12 pages.
Assigned Action Editor: ~Takafumi_Kanamori1
Submission Number: 1371
Loading