Positive Unlabeled Learning Selected Not At Random (PULSNAR): class proportion estimation when the SCAR assumption does not hold

TMLR Paper1798 Authors

07 Nov 2023 (modified: 19 Jan 2024)Rejected by TMLREveryoneRevisionsBibTeX
Abstract: Positive and Unlabeled (PU) learning is a type of semi-supervised binary classification where the machine learning algorithm differentiates between a set of positive instances (labeled) and a set of both positive and negative instances (unlabeled). PU learning has broad applications in settings where confirmed negatives are unavailable or difficult to obtain, and there is value in discovering positives among the unlabeled (e.g., viable drugs among untested compounds). Most PU learning algorithms make the \emph{selected completely at random} (SCAR) assumption, namely that positives are selected independently of their features. However, in many real-world applications, such as healthcare, positives are not SCAR (e.g., severe cases are more likely to be diagnosed), leading to a poor estimate of the proportion, $\alpha$, of positives among unlabeled examples and poor model calibration, resulting in an uncertain decision threshold for selecting positives. PU learning algorithms can estimate $\alpha$ or the probability of an individual unlabeled instance being positive or both. We propose two PU learning algorithms to estimate $\alpha$, calculate calibrated probabilities for PU instances, and improve classification metrics: i) PULSCAR (positive unlabeled learning selected completely at random), and ii) PULSNAR (positive unlabeled learning selected not at random). PULSNAR uses a divide-and-conquer approach that creates and solves several SCAR-like sub-problems using PULSCAR. In our experiments, PULSNAR outperformed state-of-the-art approaches on both synthetic and real-world benchmark datasets.
Submission Length: Regular submission (no more than 12 pages of main content)
Previous TMLR Submission Url: https://openreview.net/forum?id=XCAQwD4iIU
Changes Since Last Submission: After our original TMLR submission on July 12, 2023, we carefully addressed the thorough feedback from reviewers and submitted an updated manuscript reflecting these revisions on August 30. Regrettably, it seems that the revisions, complete with a comprehensive list of changes, were not fully taken into account during the latest review cycle, as reflected by the rejection notice, which highlights concerns previously resolved. Per discussion with the editor-in-chief, we are submitting our revised manuscript for review, which we have shortened over the last revision to fit a 12-page format. We made the following revisions: 1. Clarified notation: - In section 1, we modified the notation for the classifier in line 3 from $f: x-> y$ to $f : \mathcal{X} \rightarrow 0, 1$ - Inconsistency with symbols $(x,y,s)$ was fixed in section 3.1 - In section 3.4, we removed the first line of equation 5 that showed the weighted sum of $\alpha$. - In section 3.4.2, we modified the sentence to state, "We iterate n_components over 1 . . . m". Earlier, we had 1...25". - In equation 4, we changed the symbol for the function from $f$ to $h$. 2. Clarified assumptions and one-sample vs two-sample scenarios. We added section 3.1.1 to explain the positive unlabeled (PU) data assumptions our PULSCAR and PULSNAR algorithms make to handle one-sample and two-sample scenarios. 3. Highlighted only major contributions: In section 1, we shortened our list of contributions to three major ones. 4. Related work (section 2) was shortened and reorganized for clarity, as requested by a reviewer. 5. Relationships were clarified between probability density functions of positive, unlabeled, and alpha. We updated section 3.2 to fix the error related to the relationship between $f_p(x)$, $f_u(x)$ and $\alpha$. We showed why $\alpha f_p(x) \le f_u(x)$ holds and explained the intuition behind the selection of the objective function. 6. Handling class imbalance was further explained in section 4. 7. We included an appendix to provide a better view of the magnitude and direction of the errors from the Figure 3 $\alpha$ estimates for synthetic selected completely/not at random (SCAR/SNAR) data. 8. We explained why our methods perform better than others and the limitations of our methods: - We updated section 6 to add more details to the discussion and conclusion to explain why our methods work better than other PU methods. - We added a new section (section 7) for limitations. 9. We address concerns about how the number of clusters is determined and our approach to clustering: - In section 3.4.2, we stated that we have implemented the "Knee Point Detection in BIC" approach provided in the published literature [1] in PULSNAR to determine the number of clusters. Since the method was adopted from [1], we cited it instead of restating their approach in our paper. Regarding the concern about iteration from 1…25 to determine the number clusters, we responded to the reviewer that 25 is not a hard-coded value. Our implementation of PULSNAR allows us to set it to any value. However, we have modified our manuscript (section 3.4.2) to clarify this. Concerns that iterating could be slow are noted, but our algorithm ran faster than all other tested ones, several of which could not be used for our larger test cases. - In section 3.4.1, we explained why clustering converts a SNAR problem into multiple more SCAR-like problems. 10. We address an issue raised only in the rejection that the advantage of the proposed method was not mentioned. The advantages of the PULSNAR method are described in the "Introduction" section. In particular, we state that we can estimate the proportion of positive instances in SCAR and SNAR scenarios, enhancing classification accuracy and calibrating probabilities in PU learning, even in imbalanced datasets and with a small fraction of positive examples. In the "Results" section, we demonstrated that when the SCAR assumption fails, PU methods based on the SCAR assumption either overestimate or underestimate the proportion of positives among unlabeled examples. On such datasets, PULSNAR outperformed all tested state-of-the-art PU methods for $\alpha$ estimation, clearly showing its advantages. [1] Qinpei Zhao, Ville Hautamaki, and Pasi Fränti. Knee point detection in bic for detecting the number of clusters. In International conference on advanced concepts for intelligent vision systems, pp. 664–673. Springer, 2008.
Assigned Action Editor: ~Qibin_Zhao1
Submission Number: 1798
Loading