Estimating Expected Calibration Error for Positive-Unlabeled Learning

Estimating Expected Calibration Error for Positive-Unlabeled Learning

TMLR Paper5463 Authors

24 Jul 2025 (modified: 14 Oct 2025)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: The reliability of probabilistic classifiers hinges on their calibration---the property that their confidence accurately reflect the true class probabilities. The expected calibration error (ECE) is a standard metric for quantifying the calibration of classifiers. However, its estimation presumes access to ground-truth labels. In positive-unlabeled (PU) learning, only positive and unlabeled data are available, which makes the standard ECE estimator inapplicable. Although PU learning has been extensively studied for risk estimation and classifier training, calibration in this setting has received little attention. In this paper, we present PU-ECE, the first ECE estimator for PU data. We provide non-asymptotic bias bounds and prove convergence rates that match those of the fully supervised ECE with an optimal bin size. Furthermore, we develop an information-theoretic generalization error analysis of PU-ECE by formalizing the conditional mutual information (CMI) for a PU setting. Experiments on synthetic and real-world benchmark datasets validate our theoretical analysis and demonstrate that our PU-based ECE estimator achieves performance comparable to that of the fully-labeled ECE estimator.

Submission Length: Regular submission (no more than 12 pages of main content)

Changes Since Last Submission: Addressed the comments by Reviewer k7h8. **Section 3** - Clarified computational complexity of PU-ECE as $\mathcal{O}(n_\mathrm{U} + n_\mathrm{P})$, same as standard ECE (Section 3.1). **Section 5** - Added Section 5.3 analyzing the effect of prior estimation error on PU-ECE robustness. - Introduced experiments on the DDI dataset [Herrero-Zazo et al., 2013] to demonstrate real-world applicability (Sections 5.2 and 5.3). **Appendices** - Added Appendix G comparison of PU and supervised reliability diagrams on both the synthetic and benchmark datasets. - Added Appendix I, empirical comparison between UMB and UWB binning strategies.

Assigned Action Editor: ~Bruno_Loureiro1

Submission Number: 5463

Loading