Keywords: Multilabel classification, Conformal prediction, Uncertainty estimation, Ensembles, Calibration, Reliable AI
Abstract: In high-stakes domains, predictions must not only be accurate but critical cases should also not be missed. Conformal prediction (CP), which offers distribution-free coverage guarantees, help towards that direction, but it often produces unstable or overly large prediction sets. Multilabel classification (MLC) increases further the challenge, because the model predicts multiple labels per instance in a typically large and imbalanced label space. The increased uncertainty, compared to binary or multiclass tasks, motivates the following research question: _how can we obtain smaller, more informative prediction sets from trained MLC models while preserving marginal coverage and maintaining the theoretical guarantees of CP?_ To address this question, we investigate ensembling, which can improve stability and efficiency yet its potential in MLC has not been fully explored. We conduct a systematic empirical study across standard MLC benchmarks (COCO, Yeast, Emotions) building ensembles under (i) majority voting, (ii) calibrated aggregation of nonconformity scores, and (iii) performance-weighted aggregation. We find that our ensembles for all three categories consistently improve over single-model CP yielding more efficient prediction sets (smaller and more informative), while maintaining target coverage and achieving higher macro-F1 scores.
Majority-voting ensembles, however, also satisfy theoretical lower bounds.
Supplementary Material: zip
Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
Submission Number: 17776
Loading