Conformal prediction under ambiguous ground truth

David Stutz; Abhijit Guha Roy; Tatiana Matejovicova; Patricia Strachan; Ali Taylan Cemgil; Arnaud Doucet

Conformal prediction under ambiguous ground truth

David Stutz, Abhijit Guha Roy, Tatiana Matejovicova, Patricia Strachan, Ali Taylan Cemgil, Arnaud Doucet

Published: 26 Oct 2023, Last Modified: 17 Sept 2024Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Conformal Prediction (CP) allows to perform rigorous uncertainty quantification by constructing a prediction set $C(X)$ satisfying $\mathbb{P}(Y \in C(X))\geq 1-\alpha$ for a user-chosen $\alpha \in [0,1]$ by relying on calibration data $(X_1,Y_1),...,(X_n,Y_n)$ from $\mathbb{P}=\mathbb{P}^{X} \otimes \mathbb{P}^{Y|X}$. It is typically implicitly assumed that $\mathbb{P}^{Y|X}$ is the ``true'' posterior label distribution. However, in many real-world scenarios, the labels $Y_1,...,Y_n$ are obtained by aggregating expert opinions using a voting procedure, resulting in a one-hot distribution $\mathbb{P}_{\textup{vote}}^{Y|X}$. This is the case for most datasets, even well-known ones like ImageNet. For such ``voted'' labels, CP guarantees are thus w.r.t. $\mathbb{P}_{\textup{vote}}=\mathbb{P}^X \otimes \mathbb{P}_{\textup{vote}}^{Y|X}$ rather than the true distribution $\mathbb{P}$. In cases with unambiguous ground truth labels, the distinction between $\mathbb{P}_{\textup{vote}}$ and $\mathbb{P}$ is irrelevant. However, when experts do not agree because of ambiguous labels, approximating $\mathbb{P}^{Y|X}$ with a one-hot distribution $\mathbb{P}_{\textup{vote}}^{Y|X}$ ignores this uncertainty. In this paper, we propose to leverage expert opinions to approximate $\mathbb{P}^{Y|X}$ using a non-degenerate distribution $\mathbb{P}_{\textup{agg}}^{Y|X}$. We then develop \emph{Monte Carlo CP} procedures which provide guarantees w.r.t. $\mathbb{P}_{\textup{agg}}=\mathbb{P}^X \otimes \mathbb{P}_{\textup{agg}}^{Y|X}$ by sampling multiple synthetic pseudo-labels from $\mathbb{P}_{\textup{agg}}^{Y|X}$ for each calibration example $X_1,...,X_n$. In a case study of skin condition classification with significant disagreement among expert annotators, we show that applying CP w.r.t. $\mathbb{P}_{\textup{vote}}$ under-covers expert annotations: calibrated for $72\%$ coverage, it falls short by on average $10\%$; our Monte Carlo CP closes this gap both empirically and theoretically. We also extend Monte Carlo CP to multi-label classification and CP with calibration examples enriched through data augmentation.

Submission Length: Long submission (more than 12 pages of main content)

Changes Since Last Submission: * Abstract has been completely rewritten * The introduction - Section 1 - has been completely rewritten. * Section 2 has been reorganized and now Conformal prediction is introduced before discussing the toy example (compared to the previous version, Section 2.1 and Section 2.2 have been swapped and partially rewritten) * Section 3.1 has been significantly expanded to give some common examples of how aggregation model $\mathbb{P}_{\textup{agg}}^{Y|X}$ can be designed. We have also clarified the terminology introducing formally aggregated coverage (previously expected coverage) and voted coverage. * The previous Section 3.2 and Section 3.3. which were discussing strategies to perform CP using directly the plausibilities have been suppressed and we have focused the whole manuscript on Monte Carlo conformal prediction. * The new Section 3.2 (previously section 3.4) introducing Algorithm 1 has been reorganized and partially rewritten. Section 3.3 is now what used to be Section 3.4.1 and Section 3.4 introducing Algorithm 2 is what used to be section 3.4.2. They have been slightly edited. * We have introduced a novel Section 3.5 and a corresponding Table 1 which summarize the theoretical and empirical properties of these algorithms. * The extension of Monte Carlo CP to multi-label classification and data augmentation is now in Section 3.6 (before it was 3.5.1.) and we have shortened and clarified the techniques. * The application Section 4 has been streamlined (e.g. what was previously Figure 13 has been suppressed) and we have edited the section to clarify our results. * Finally we have completely rewritten the final section, Section 5, and have added a broader impact subsection in it on the suggestion of a reviewer. * We have introduced a new Appendix A which explains how labels have been obtained for many standard ML datasets.

Supplementary Material: pdf

Assigned Action Editor: ~Fredrik_Daniel_Johansson1

License: Creative Commons Attribution 4.0 International (CC BY 4.0)

Submission Number: 1399

Loading