\documentclass{midl} % Include author names

% The following packages will be automatically loaded:
% jmlr, amsmath, amssymb, natbib, graphicx, url, algorithm2e
% ifoddpage, relsize and probably more
% make sure they are installed with your latex distribution

%%%%%%%%%%%%%%%%%%%%%%
\usepackage{url}
\usepackage{float} 
\usepackage{todonotes}
\usepackage{booktabs}

\usepackage{appendix} 
\usepackage{pifont}
\newcommand{\cmark}{\textcolor{green}{\ding{51}}} % ok
\newcommand{\xmark}{\textcolor{red}{\ding{55}}}   % cross
\usepackage{wrapfig}
\usepackage{adjustbox}
\usepackage{bm} 
\usepackage[utf8]{inputenc} % allow utf-8 input
\usepackage[T1]{fontenc}    % use 8-bit T1 fonts
\usepackage{url}            % simple URL typesetting
\usepackage{booktabs}       % professional-quality tables
\usepackage{amsfonts}       % blackboard math symbols
\usepackage{nicefrac}       % compact symbols for 1/2, etc.
\usepackage{microtype}      % microtypography
\usepackage{xcolor}         % colors
\usepackage{pifont}
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{mathtools}
\usepackage{paralist}
\usepackage[para]{footmisc}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\newtheorem{assumption}[theorem]{Assumption}


% \theoremstyle{remark}
% \newtheorem{remark}[theorem]{Remark}
\usepackage{multirow}
% \usepackage{subfig}
\usepackage{enumitem}
\usepackage{colortbl}
\usepackage[most]{tcolorbox}
\newcommand{\care}{\textsc{Care}}

\usepackage{wrapfig}
\usepackage[normalem]{ulem} 
\usepackage{soul}
\usepackage{xcolor}
\usepackage[most]{tcolorbox}
\usepackage{xcolor}
\usepackage{caption} 
\usepackage{cancel}
%%% for rebuttal %%%
\newtcolorbox{highlight}{
  % colback=yellow,    
  % colframe=yellow,     
  coltext=red,       
  sharp corners,     
  boxrule=0pt,       
  breakable,          
  enhanced,
  %%% width %%%
  left=2pt,        
  right=2pt,       
  top=5pt,         
  bottom=5pt,      
  boxsep=0pt,      
  leftrule=0pt,    
  rightrule=0pt,
  width=\textwidth,  
  % -----------------------
  before upper={
    \captionsetup{labelfont={color=red}, textfont={color=red}}
  }
}
\newcommand{\tbd}[1]{\textcolor{red}{#1}}
\newcommand{\task}[1]{\textcolor{deepgreen}{\textbf{\boldmath #1}}}
\usepackage{todonotes}

\newcolumntype{Y}{>{\centering\arraybackslash}X}

\usepackage[table]{xcolor}
\usepackage{colortbl}
\definecolor{lightblue}{rgb}{0.85, 0.92, 0.96}
\definecolor{deepgreen}{RGB}{0,100,0} 
\definecolor{peach}{RGB}{255, 180, 180}
\definecolor{orangee}{RGB}{255, 220, 180}
\definecolor{lightgray}{gray}{0.9}
\definecolor{midgray}{gray}{0.8}
\definecolor{lightgreen}{rgb}{0.88, 0.95, 0.88}
\definecolor{group1}{RGB}{255,240,238}
\definecolor{group2}{RGB}{245,255,250}  % Mint Cream
\definecolor{group3}{RGB}{255,250,240}  % Papaya Whip (more distinct warm tone)
\definecolor{group4}{RGB}{250,240,255}  % Lavender Blush
\definecolor{group5}{RGB}{240,248,255} % Alice Blue
\usepackage{tikz}
\usetikzlibrary{calc}
\newcommand*\circled[1]{\tikz[baseline=(char.base)]{
            \node[shape=circle,draw,inner sep=.6pt] (char) {#1};}}
\usepackage{hyperref}       % hyperlinks


%%%%%%%%%%%%%%%%%%%%%%

\usepackage{mwe} % to get dummy images
% \jmlrvolume{-- Under Review}
% \jmlrworkshop{Full Paper -- MIDL 2026 submission}
% \editors{Under Review for MIDL 2026}
\jmlrvolume{-- Published}
\jmlryear{2026}
\jmlrworkshop{Full Paper -- MIDL 2026}
\editors{Accepted for publication at MIDL 2026}

\title[CARE: \underline{C}onfidence-\underline{A}ware \underline{R}atio \underline{E}stimation for Medical Biomarkers]{CARE: \underline{C}onfidence-\underline{A}ware \underline{R}atio \underline{E}stimation \\ for Medical Biomarkers}

\midlauthor{
\Name{Jiameng Li\nametag{$^{1}$}} \Email{jiameng.li@kuleuven.be}\\
\Name{Teodora Popordanoska\nametag{$^{1}$}} \Email{teodora.popordanoska@kuleuven.be}\\
\Name{Aleksei Tiulpin\nametag{$^{2,3}$}} \Email{alt4026@med.cornell.edu}\\
\Name{Sebastian G. Gruber\nametag{$^{1}$}} \Email{sebastian.gruber@kuleuven.be}\\
\Name{Frederik Maes\nametag{$^{1}$}} \Email{frederik.maes@kuleuven.be}\\
\Name{Matthew B. Blaschko\nametag{$^{1}$}} \Email{matthew.blaschko@kuleuven.be}\\[4pt]
\addr $^{1}$ KU Leuven \qquad $^{2}$ University of Oulu  \qquad $^{3}$ Weill Cornell Medicine
}


\begin{document}

\maketitle

\begin{abstract}
Ratio-based biomarkers (RBBs), such as the proportion of necrotic tissue within a tumor, are widely used in clinical practice to support diagnosis, prognosis, and treatment planning. These biomarkers are typically estimated from segmentation outputs by computing region-wise ratios. Despite the high-stakes nature of clinical decision making, existing methods provide only point estimates, offering no measure of uncertainty. In this work, we propose a unified \textit{confidence-aware} framework for estimating ratio-based biomarkers. Our uncertainty analysis stems from two observations: (1) the probability ratio estimator inherently admits a statistical confidence interval regarding local randomness (bias and variance); (2) the segmentation network is not perfectly calibrated (calibration error).
We perform a systematic analysis of error propagation in the segmentation-to-biomarker pipeline and identify model miscalibration as the dominant source of uncertainty.  Extensive experiments show that our method produces statistically sound confidence intervals, with tunable confidence levels, enabling more trustworthy application of segmentation-derived RBBs in clinical workflows.
\textbf{Codes: \url{https://github.com/renaissanceee/care}}
\end{abstract}

\begin{keywords}
Medical Imaging Analysis, Uncertainty Quantification, Trustworthy AI
\end{keywords}

\section{Introduction}
\label{sec:intro}


The success of deep learning in medical image analysis, particularly since the introduction of UNet architectures \citep{ronneberger2015u, isensee2021nnu}, has enabled automated segmentation of anatomical and pathological structures across a range of clinical imaging tasks. However, segmentation is rarely the end goal in clinical practice. Instead, it often serves as an intermediate step towards quantifying tissue biomarkers, such as volumes \citep{popordanoska2021relationship,rousseau2025post,kazerouni2023diffusion,abdusalomov2023brain} and fraction scores \citep{ronneberger2015u,isensee2021nnu, bahna2022tumor,kim2008tumor,solovyev2020bayesian} that are used to assess disease progression, guide treatment decisions, or monitor therapeutic responses. The ratio-based biomarkers are of specific interest in this paper, which are typically derived from two volume measurements computed from pixel-wise predictions. We note here that the naive computation of an RBB from a standard segmentation model does not offer uncertainty quantification (UQ), which limits the clinical adoption and undermines its reference value for decision-making.
To address this, we study confidence-aware ratio estimation for RBBs.

\begin{figure}[!t]
\centering
\begin{minipage}[b]{0.57\linewidth}
\centering
    \includegraphics[width=\linewidth]{figure/fig2/fig2_a.pdf}
    \textbf{(a) Biomarkers}
    \label{subfig:fig2_biomarkers}
\end{minipage}
\begin{minipage}[b]{0.42\linewidth}
    \centering
\includegraphics[width=\linewidth]{figure/fig2/fig2_b_two_patient.pdf}
    \textbf{(b) Clinical support}
    \label{subfig:fig2_clinical_support}
\end{minipage}
\caption{Examples of ratio-based biomarkers and their roles in clinical support. (a): Ratio-based biomarkers \citep{baid2021rsna, myronenko2023automated} exist in many organs and modalities. (b): An illustrative example where a high-risk threshold is defined as $0.25$; \care~calls for human check when confidence intervals cross the thresholds.}
\label{fig:fig2}
\end{figure}

Fig.~\ref{fig:fig2} (a) shows examples of two clinically used RBBs:  necrosis-to-tumor ratio (NTR) and tumor-to-kidney ratio (TKR). The NTR is mostly used in brain cancer treatment~\citep{henker2019volumetric, henker2017volumetric} to quantify the proportion of necrotic (non-viable) tissue within a tumor. TKR~\cite{herts2002enhancement} indicates the extent of tumor infiltration within the kidney and is computed (mainly) from abdominal CT. A straightforward method for computing these ratios involves using segmentation models to identify the subregion and the whole foreground region, and then calculating the ratio based on averaged softmax confidence scores over these regions. However, the interpretation of this point estimate can change once the confidence interval is considered. Following the example in Fig.~\ref{fig:fig2} (b), 
consider a clinical threshold of 0.25 for initiating aggressive treatment. Based on point estimates alone, Patient~A would receive aggressive treatment (high ratio) while Patient~B would receive mild treatment (low ratio).However, if the associated confidence interval spans the decision threshold (case 2), the estimation is flagged for mandatory expert review to mitigate potential misdiagnosis risk. Such double-check procedures are essential in clinical practice, as they provide an additional safeguard for patients and enhance the robustness of downstream decision-making.

\begin{figure}[ht]
\centering
\includegraphics[width=\linewidth]{figure/fig1/Figure_1_sep_2.pdf}
\caption{
    {Overview.} In automated medical imaging analysis, biomarkers are often computed from network predictions. 
    To quantify the uncertainty of ratio-based biomarkers, we introduce \care, a confidence-aware estimation method providing reliable confidence intervals.}
\label{fig:fig1}
\end{figure}




Despite the clinical importance of quantifying uncertainty, most efforts continue to focus on improving the accuracy of the upstream segmentation \citep{ronneberger2015u,isensee2021nnu,hatamizadeh2021swin}. We propose \care, a framework for estimation of confidence intervals in RBBs that is mathematically grounded, does not require additional training or sampling at test time. Our core contribution lies in the identification of sources of error and quantifying their individual impacts on the overall confidence intervals (Fig.~\ref{fig:fig1}).. Specifically, we establish a ratio estimator bound using Markov's inequality~\citep{resnick2003probability} and derive a squared error estimator from volume predictions. To quantify the error caused by miscalibration, we provide theoretical insights into the relationship between model calibration and ratio estimation and propose a miscalibration-based bound, building on recent advances in calibration error (CE) estimation \citep{guo2017calibration,popordanoska2022consistent} and a recently established connection between volume estimation bias and CE. In summary, our main contributions are:
\begin{enumerate}
    \item We propose \care, a principled framework for trustworthy estimation of ratio-based biomarkers in an automated estimation workflow with minimum assumptions.
    \item  We analyze the sources of error across the entire segmentation-to-biomarker pipeline and empirically demonstrate that miscalibration is the dominant factor.
    \item Experiments confirm that \care~effectively tracks the prediction uncertainty, evidenced by its coverage of erroneous predictions and its distinguishability of segmentation difficulties. In addition, \care~yields tighter confidence intervals than other sound adaptive uncertainty quantification methods.
\end{enumerate}



%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Preliminaries}
\label{sec:analysis}

We define the RBB as a ratio between volumes $V_\mathrm{A}$ and $V_\mathrm{B}$ \citep{henker2019volumetric,henker2017volumetric}.
The ratio is estimated via a standard segmentation framework,
% We consider the ratio estimation within a standard segmentation framework, 
where $V_\mathrm{A}$ and $V_\mathrm{B}$ are calculated from predicted probabilities.



\begin{definition}[Ratio from Segmentation Networks]
\label{def:NN_ratio}
Given per-pixel inputs $\{z_i\}_{i=1}^n$, labels $\{y_{A,i}, y_{B,i}\}_{i=1}^n$ and segmentation model $g$: $z_i\rightarrow g_{A}(z_i),g_{B}(z_i)\in [0,1] $, 
the labeled ratio $r_{\mathrm{gt}}$ and predicted ratio $\hat{r}$ within $n$ pixels are calculated by:
\begin{equation}
    \begin{aligned}
    r_\mathrm{gt} = \frac{\bar{y_A}}{\bar{y_B}} = \frac{\sum_{i=1}^n y_{A,i}}{\sum_{i=1}^n y_{B,i}} \text{, and }  
    \hat{r} = \frac{\bar{g_A}}{\bar{g_B}} = \frac{\sum_{i=1}^n g_A(z_i)}{\sum_{i=1}^n g_B(z_i)}.
    \end{aligned}
\end{equation}
\label{eq:ratio_NN}
\end{definition}


\begin{definition}[Confidence Interval]
\label{def:ci}
Let $\hat r$ be an estimator of an unknown but fixed true value $r_{\mathrm{gt}}$, and let $\alpha \in (0,1)$ be a given significance level.
The $(1-\alpha)$ confidence interval for $r_{\mathrm{gt}}$ is a interval $[\epsilon_l, \epsilon_u]$ such that
\begin{equation}
\mathbb{P}\!\left( \epsilon_l \le r_{\mathrm{gt}} \le \epsilon_u \right) \ge 1 - \alpha .
\label{eq:ci}
\end{equation}
\end{definition}

\begin{definition}[Empirical Coverage Rate]
\label{def:coverage}
Let $\mathcal{D}_\mathrm{test} = \{(\hat{r}_i, r_{\mathrm{gt},i})\}_{i=1}^{n}$ be a test set of $n$ samples, where $\hat{r}_i, r_{\mathrm{gt},i}$ are the prediction and  the ground-truth. 
Suppose a UQ method provides a confidence interval $[\epsilon_{l,i}, \epsilon_{u,i}]$ for $\hat{r}_i$ at a confidence $(1-\alpha)$.
The empirical coverage rate measures the proportion of samples whose true values fall within the confidence interval:
\begin{equation}
\mathrm{Coverage}(\mathcal{D}_\mathrm{test}) = \frac{1}{n} \sum_{i=1}^{n} \mathbf{1}\!\left( \epsilon_{l,i} \le r_{\mathrm{gt},i} \le \epsilon_{u,i} \right) ,
\label{eq:coverage}
\end{equation}
where $\mathbf{1}(\cdot)$ is an identity function that equals 1 if the condition is true and 0 otherwise. Based on Def.~\ref{def:ci}, a UQ method with coverage guarantee should satisfy $\mathrm{Coverage}(\mathcal{D}_\mathrm{test}) \ge 1 - \alpha$.
\end{definition}



One of the most straightforward frameworks to quantify uncertainty in $\hat{r}$ is conformal prediction (CP)~\citep{shafer2008tutorial}, a frequentist distribution-free method relying on minimal assumptions. In the basic form for regression, i.e. estimation of $r_\mathrm{gt}$, CP enables the following confidence intervals with theoretical guarantees.


\begin{proposition}[Conformalized Quantile Regression (CQR)]
\label{prop:CP}
\citep{romano2019conformalized} Given ground-truth $r_{\mathrm{gt}}$, prediction $\hat{r}$ and the absolute error residual  $e_r \coloneqq |r_{\mathrm{gt}} - \hat{r}|$, let $q_{e_r, \delta}$ denote the $\frac{n+1}{n}(1-\delta)$ quantile of the instance-wise $e_r$ on a validation set $\mathcal{D}_\text{val}$ of size $n$.
Then, with  probability at least $1-\delta$ 
\begin{align}
r_\mathrm{gt} \in \left[\hat{r} - q_{e_r, \delta}, \hat{r} + q_{e_r, \delta}\right],
\label{eq:bound_cp}
\end{align}
\label{prop:bound_cp}
\end{proposition}

The biggest challenge of na\"ive CQR is that it does not allow for adaptivity. For example, in the case of TKR or NTR, it is important to take the tumor size into account, as it is much harder to annotate small objects. We therefore consider the adaptive CQR. 

\begin{proposition}[Adaptive Conformalized Quantile Regression (ACQR)] \citep{angelopoulos2021gentle}
Let $u_r{>0}$ be an uncertainty measure of $r$, the instance-wise conformity score of the residual term is defined as $s_r \coloneqq \frac{e_r}{u_r} = \frac{|r_{\mathrm{gt}} - \hat{r}|}{u_r}$.
Similar to Prop.~\ref{prop:bound_cp}, let $q_{s_r,\delta}$ denote the $\frac{n+1}{n}(1-\delta)$ quantile of $s_r$ from $D_\mathrm{val}$.
Then, with probability at least $1-\delta$
\begin{align}
r_{\mathrm{gt}} \in \big[\hat r - u_r\,q_{s_r,\delta},\;\; \hat r + u_r\,q_{s_r,\delta}\big].
\label{eq:bound_acp}
\end{align}
When $u_r=1$, the score $s_r$ degrades to a residual term $e_r$, \textit{i.e.} ACQR degrades to CQR. 
\label{prop:bound_acp}
\end{proposition}

Despite the mathematical guarantees of ACQR, the choice of $u(x)$ remains non-trivial and requires domain expertise. This naturally limits its generalizability. In this paper, we follow the intuition that small tumors contain greater uncertainty and define $u(x)$ for RBBs in tumor cases as follows.
\begin{remark}[Uncertainty Measure in Tumors]
In tumor-related applications, uncertainty is often characterized by tumor size. Consider $V_\mathrm{T}$ being the tumor volume for sample $x$, and $V_\mathrm{T,max}$ be the maximum tumor size that can be measured in a particular application. We then define $u(x)$ as
\begin{align}
u(x)=\lambda\left(1-\frac{V_\mathrm{T}}{V_\mathrm{T,max}+\epsilon}\right),~\text{with }\lambda=\frac{1}{2q_{s_r,\delta}},
\label{eq:bound_cp_ada}
\end{align}
for ACQR implementation, see derivation in Sec.\ref{subsec:proof_u} in appendix.
\label{rmk:u_size}
\end{remark}

By Def.~\ref{def:NN_ratio}, the predicted ratio $\hat{r}$ is determined by the probability volumes predicted by the network.
Since the network is not perfectly calibrated, quantifying the uncertainty in its predictions is closely tied to assessing volume bias. 
\begin{definition}[Volume Bias (V-Bias)] \citep{popordanoska2021relationship}
    \label{def:bias_base}
    Given a segmentation model \( g \colon \mathcal{Z} \to [0,1] \) that predicts the probability of \( y \in \{0,1\} \), the volume bias is defined as:
    \begin{align}
        \operatorname{V-Bias} \left( g \right) & \coloneqq \mathbb{E}_{(z,y) \sim P} \left[g \left( z \right)-y \right].
        \label{eq:l1_bias}
    \end{align}
\end{definition}

One can observe a direct connection between the V-bias and the residual in the definition of CP. It has been shown by~\citep{popordanoska2021relationship} that V-Bias is upper bounded by calibration error, which is mathematically defined as follows. 

\begin{definition}[Calibration Error (CE)]
\label{def:calibration_error}
\citep{kumar2019verified}
Given a model \( g \colon \mathcal{Z} \to [0,1] \) that predicts the probability of $y\in \{0,1\} $, the calibration error is defined as:
\begin{align}
    \operatorname{CE} (g) &\coloneqq \mathbb{E}_{(z,y) \sim P} \left[\left\lvert  g(z)-\mathbb{E} \left[ y=1 \mid g \left( z \right) \right] \right\rvert \right],
    \label{eq:l1_ce}
\end{align}
\end{definition}

The mentioned relationship between CE and V-bias was defined by~\citet{popordanoska2021relationship} as follows.
\begin{proposition}[The Relationship of V-Bias and CE]
\label{prop:relation_l1} 
\citep{popordanoska2021relationship}
Given segmentation model \( g \colon \mathcal{Z} \to [0,1] \), the absolute value of volume bias is upper bound by the calibration error, \textit{i.e.}, 
$|\operatorname{V-Bias}(g)|\leq\operatorname{CE}(g)$.
\end{proposition}
%%%%%%%%%%%%%%%%%%%%%%%%%%%

In the next section, we first derive a local RBB interval considering the bias and variance.
Then, we extend the concept of Conformalized Quantile Regression to V-Bias to derive a miscalibration RBB bound. To provide a comprehensive analysis, we additionally discuss CE as an alternative upper bound.




\section{\care: Confidence-aware Ratio Estimation}
\label{sec:two_bounds}

\paragraph{Overview.} In this section, we illustrate our insight of uncertainty analysis based on two key observations.
The first observation is that the ratio estimator $\hat{r} = \tfrac{\bar{y}}{\bar{x}}$ is subject to instance-wise randomness, which we capture using statistical tools such as Markov’s inequality to derive an \underline{\textit{estimation-based interval}}.
The second observation is that the network is not perfectly calibrated, introducing a global, model-level error affecting both the numerator and denominator; this gives rise to the \underline{\textit{calibration-based interval}}.
Combining these two sources yields the overall uncertainty bound.


\paragraph{Estimation-based interval.}
\citet{van2000mean} provides an approximated theoretical result for ratio statistics. However, their derivation critically relies on the assumption that the addends in $\bar{x}$ and $\bar{y}$ are independent. 
% which corresponds to independence among pixels in our case. 
Therefore, the result in \citet{van2000mean} is not directly applicable in imaging analysis for violating spatial patterns. As a remedy, we construct Markov bounds as an estimation-based confidence interval for $\hat{r}$ using Markov inequality \citep{resnick2003probability}. Although this approach leads to more conservative bounds, it avoids strong assumptions such as pixel independence, making it more applicable to image data.


\begin{proposition}[Estimation-based Confidence Interval]
\label{prop:bound_local} 
Given an estimator $\hat{r}= \frac{\bar{y}}{\bar{x}}$ of the fraction $r=\frac{\mathbb{E} \left[ y \right]}{\mathbb{E} \left[ x \right]}$ with random variables $x$ and $y$, it holds with at least $1 - \alpha$ probability that
\begin{align}
r &\in \left[
  \hat{r} - \beta_{r,\alpha},\;
  \hat{r} + \beta_{r,\alpha}
\right],
\label{eq:bound_beta}
\end{align}
where $\beta_{r,\alpha} \coloneqq \frac{\mathrm{\sqrt{SE_{\hat{r}}}}}{\sqrt{\alpha}}$ is the half-width of the bound, and $\mathrm{SE_{\hat{r}}} \coloneqq \mathbb{E} \left[ \left( \hat{r} - r \right)^2 \right]$ is the expected squared error.
\end{proposition}

Then we conduct a Taylor expansion of $\mathrm{SE_{\hat{r}}}$ to receive an approximation we can estimate in practice.


\begin{proposition}
    Assume all central moments of the independently and identically distributed random variables $\left(x_1, y_1 \right), \dots, \left(x_n, y_n \right) \sim \mathbb{P}_{xy}$ in the estimator $\hat{r} = \frac{\bar{y}}{\bar{x}}$ exist, then we have
        \begin{equation}
        \mathrm{SE_{\hat{r}}} = \frac{1}{n} \left( \frac{\operatorname{Var} \left( y \right)}{\mu_x} + \operatorname{Var} \left( x \right) \frac{\mu_y^2}{ \mu_x^4} - 2 \operatorname{Cov} \left( x, y \right)  \frac{\mu_y}{ \mu_x^3} \right) + O \left( \frac{1}{n^2} \right).
    \label{eq:beta_taylor_ind}
    \end{equation}
\end{proposition}
Further, the estimator is:
\begin{equation}
        \widehat{\mathrm{SE_{\hat{r}}}} \coloneqq \frac{1}{n} \left( \frac{\hat{\sigma}_y^2}{\bar{x}} + \frac{\hat{\sigma}_x^2 \bar{y}^2}{ \bar{x}^4} - 2 \frac{\hat{\sigma}_{xy} \bar{y}}{ \bar{x}^3} \right),
\end{equation}
with the sample variances $\hat{\sigma}_x^2 = \frac{1}{n-1} \sum_i \left( x_i - \bar{x} \right)^2$, $\hat{\sigma}_y^2 = \frac{1}{n-1} \sum_i \left( y_i - \bar{y} \right)^2$, and sample covariance $\hat{\sigma}_{xy} = \frac{1}{n-1} \sum_i \left( x_i - \bar{x} \right)\left( y_i - \bar{y} \right)$.
Under i.i.d. assumption, the estimator $\widehat{\mathrm{SE_{\hat{r}}}}$ is consistent, i.e., $\widehat{\mathrm{SE_{\hat{r}}}}\to {\mathrm{SE_{\hat{r}}}}$ in probability for $n \to \infty$. The proof is presented in the appendix \ref{subsec:proof_variance}.

\paragraph{Calibration-based interval.}
% The estimation-based bounds involve local uncertainty that stems from statistical properties.
Then we analyze the second source of uncertainty: volume bias caused by miscalibration. 
Inspired by Prop. \ref{prop:CP}, we propose a fine-grained calibration-based confidence interval by considering the uncertainty of target (A) and RoI (B) regions separately, yielding asymmetric half-widths $\epsilon_{l,\delta}, \epsilon_{u,\delta}$ for lower and upper bounds.
Unlike vanilla Conformalized Quantile Regression, where the analysis starts from the final $\hat{r}$, we adopt quantiles of $V_\mathrm{A}$ and $V_\mathrm{B}$ to give the calibration-based confidence interval of RBB, see Prop.~\ref{prop:bound_miscal} (appendix \ref{subsec:proof_volume}). Combined with Prop.~\ref{prop:bound_local}, we propose \care~(V-Bias), which requires minimum assumptions and gets rid of the dedicated uncertainty scores, compared with ACQR.
%%%%%%%%%%%%%%%%%%%%%%%%
As described in Prop.~\ref{prop:relation_l1}, V-Bias is upper bounded by the corresponding calibration error, \textit{i.e.}, $\lvert \text{V-Bias}\left(g_A\right) \rvert \leq \operatorname{CE} \left(g_A\right)$, $\lvert \text{V-Bias}\left(g_B\right) \rvert \leq \operatorname{CE} \left(g_B\right)$.
This motivates a more conservative interval named as \care~(ECE). 


\begin{proposition}[Overall Confidence Interval]
\label{prop:bound_all} 
Assume we have a ratio estimator $\hat{r} = \frac{\sum_i g_A \left( z_{i,I} \right)}{\sum_i g_B \left( z_{i,I} \right)}$ 
for pixel measurements $\{z_{i,I}\}_{i=1}^n$ of an instance $I$ based on neural network outputs $g(z_{i,I})=(g_A(z_{i,I}),g_B(z_{i,I}))$.
Let $y_A$ and $y_B$ be the instance-wise target random variables used to form the target ratio $r=\frac{\mathbb{E} \left[ y_A \mid I \right]}{\mathbb{E} \left[ y_B \mid I \right]} $.
Then, it holds with at least $1 - \alpha - \delta$ probability that
\begin{align}
r \in \left[ \frac{\sum_i g_A \left( z_{i,I} \right)}{\sum_i g_B \left( z_{i,I} \right)} - \epsilon_{l,\delta} - \beta_{r,\alpha}, \frac{\sum_i g_A \left( z_{i,I} \right)}{\sum_i g_B \left( z_{i,I} \right)} + \epsilon_{u,\delta} + \beta_{r,\alpha} \right],
\end{align}
where $\beta_{r,\alpha}$ is defined as in Prop.~\ref{prop:bound_local} and $\epsilon_{l,\delta}, \epsilon_{u,\delta}$ as in Prop.~\ref{prop:bound_miscal} (appendix \ref{subsec:proof_volume}).
\end{proposition}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

% The proof is located in the appendix \ref{sec:proofs}.
The interval width $w = B_u-B_l$ measures the uncertainty level, as a result, a wide interval over thresholds alarms for manual examination.
% In implementation, we alternate through various $\alpha$ and $\delta$ for a fixed $\alpha+\delta$ with grid search to observe the impact on the interval width.
% This way, we can choose the smallest interval under a desired coverage rate.
We perform a grid search of $\alpha$ and $\delta$, keeping $\alpha + \delta$ constant. The configuration yields the narrowest intervals that satisfy target coverage rates.
In experiments (Sec.~\ref{sec:exp}), we show empirically that \care~(V-Bais) achieves robust coverage and spans dynamically for different uncertainty levels. In addition, \care~(ECE) exhibits tighter bounds with comparable coverage \textit{w.r.t} ACQR, without extra uncertainty assumption.



\section{Experiments}
\label{sec:exp}
\subsection{Setup}
\label{subsec:exp_setup}

\paragraph{Datasets and models.}
We evaluate our method on two brain tumor segmentation datasets: MSD-Task01 \citep{antonelli2022medical} and BraTS21 \citep{baid2021rsna}, both of which provide four segmentation labels (edema, necrosis, enhancing tumor, and background). The necrosis-to-tumor ratio (NTR) is defined as $\frac{V_\mathrm{N}}{V_\mathrm{T}}$, \textit{i.e.} the ratio between the necrotic volume $V_\mathrm{N}$ and the whole tumor volume $V_\mathrm{T}$ (edema, necrosis, and enhancing regions). We additionally include KiTS23 \citep{myronenko2023automated}, a CT dataset of 489 kidney volumes, where the tumor-to-kidney ratio (TKR) is defined as 
$\frac{V_{\text{T}}}{V_{\text{Kidney}}}$. 
% $\frac{V_{\text{tumor}}}{V_{\text{whole kidney}}}$. 
To predict these biomarkers from segmentation outputs, we train nnUNet \citep{isensee2021nnu}, nnFormer \citep{zhou2021nnformer}, and UNETR++ \citep{zhou2021nnformer} using a nested five-fold cross-validation.
The predicted ratio $\hat{r}$ and labeled ratio $r_\text{gt}$ are computed from Def. \ref{def:NN_ratio}.

\paragraph{Uncertainty Quantification  baselines.} 
To control the confidence level to be $C = 0.68$, we adopt a quantile for \care, CQR, ACQR and sampling-based methods. We also implement two Bayesian methods: ensemble and dropout. Due to the expensive inference, we report their $3\sigma$ confidence intervals. More implementation details in appendix (Sec.\ref{subsec:impl}).


\paragraph{Metrics.} Recap Def.~\ref{def:ci}, a sound confidence interval can be measured by:
(1) \underline{\textit{Coverage rate}}:
The proportion of samples whose ground-truth values fall within the estimated confidence intervals (Def.~\ref{def:coverage}).
A sound confidence interval should achieve the prescribed confidence level in terms of coverage rate, while remaining as tight as possible.
(2) \underline{\textit{Adaptiveness}}:
The interval width should reflect the difficulty of prediction.
For ratio-based biomarkers, the interval width is expected to increase as the tumor size decreases,
since small tumors are generally harder to segment and tend to exhibit larger prediction errors.

Notably, there exists an intrinsic trade-off between \emph{coverage rate} and \emph{tightness}:
a naively loose and non-adaptive interval can trivially satisfy the coverage requirement,
but becomes uninformative for practical use.
Therefore, we evaluate UQ methods using both metrics jointly: 
We first identify methods that satisfy the desired coverage level (Tab.~\ref{tab:msd}),
and then compare their adaptiveness and tightness among the qualified methods (Fig.~\ref{fig:adaptive}).





\begin{table}[!h]
\centering
\caption{Comparison of the coverage guarantee on MSD-Task01 dataset ($C=0.68$). We report the overall coverage rate (\%) on test-set ($_{\pm}\text{:~error bar}$). 
\care~always satisfies the desired confidence level without being overconservative. 
}
\resizebox{0.85\textwidth}{!}{
\begin{tabular}{lcccc}
\toprule
 Coverage (\%) & \textbf{nnUNet$_\text{2d}$}  & \textbf{nnUNet$_\text{3d}$} & \textbf{nnFormer} & \textbf{UNETR++} \\
\midrule
\textbf{Ensemble ($1\sigma$)} &  6.21$_{\pm 0.36}$ &   7.54$_{\pm 0.78}$ & 6.72$_{\pm 0.56}$ &  8.24$_{\pm 0.74}$ \\
\textbf{Dropout ($1\sigma$)} & 5.78$_{\pm 0.43}$  &  7.12$_{\pm 0.66}$  & 6.23$_{\pm 0.71}$ &  8.01$_{\pm 0.86}$ \\
\textbf{CQR} & 72.11$_{\pm 1.90}$ & 67.23$_{\pm 3.88}$ & 67.92$_{\pm 1.59}$ & 65.76$_{\pm 2.11}$ \\
% \rowcolor{lightgreen}
\textbf{ACQR} & 94.22$_{\pm 1.89}$ & 94.22$_{\pm 2.88}$ & $91.78_{\pm 1.39}$ & $93.15_{\pm 1.91}$ \\
% \rowcolor{lightgreen}
\textbf{\care~(ECE)} & 94.22$_{\pm 0.99}$ & 93.61$_{\pm 0.71}$ & 87.94$_{\pm 0.97}$ & 89.58$_{\pm 1.02}$ \\
\textbf{\care~(V-Bias)} & 93.61$_{\pm 1.14}$ & 86.60$_{\pm 1.49}$ & 81.92$_{\pm 1.31}$ & 76.43 $_{\pm 2.21}$\\
\bottomrule
\end{tabular}
}
\label{tab:msd}
\end{table}


\begin{figure}[!h]
\centering
\begin{minipage}[!h]{\linewidth}
    \centering
\includegraphics[width=\linewidth]{figure/adaptive/hist.pdf}
    \textbf{(a) Interval distributions on MSD-Task01 dataset.}
    \vspace{0.3cm}
\end{minipage}
%%%%%%%%%%%%%%%%%%%
\begin{minipage}[!h]{\linewidth}
    \centering
    \includegraphics[width=\linewidth]{figure/adaptive/size.pdf}    
    \textbf{(b) Interval widths stratified by tumor-size on 3 datasets.}
\end{minipage} 
    \caption{Comparison of adaptiveness on nnUNet$_\mathrm{3d}$ ($C=0.68$).
    (a) The frequency histogram of NTR intervals in test-set. ACQR's intervals lie frequently around the middle area, while \care~has tighter bounds generally. 
    (b) The average interval width in three groups categorized by tumor sizes. Intuitively, interval width should reflect MSE$_\mathrm{r}$ tendency. 
    Compared with the indistinguishable CQR and overconservative ACQR, \care~varies appropriately wider for small tumors (hard samples) and tighter for large ones (simple).}

\label{fig:adaptive}
\end{figure}

%




\subsection{Results}
\label{subsec:results}


\paragraph{Coverage guarantee.} 
As described in Sec.\ref{sec:intro}, a conservative confidence interval achieves coverage probability higher than the nominal confidence level, \textit{i.e.}, achieving over 68\% coverage when aiming for 68\% confidence level. We report coverage rate (\%) of different UQ methods at 0.68 confidence level in Table~\ref{tab:msd}, which measures \textit{the proportion of samples whose true values fall within the confidence intervals}. 
Empirically, our intervals show higher likelihoods of satisfying the prescribed confidence level of 0.68 compared with sampling-based methods and CQR. Notably, the Bayesian methods (dropout and ensemble) show poor coverage due to a lack of an appropriate prior. 
Considering the suboptimal performance of sampling-based methods, our following comparison focuses on two methods with the coverage guarantee: ACQR and \care.



\paragraph{Adaptiveness.}
Beyond achieving the guaranteed coverage rate, the confidence interval should be sample-adaptive to identify unreliable predictions effectively. We demonstrate this capability by examining the "dataset-level interval" distribution of MSD-Task01 in Fig.~\ref{fig:adaptive} (a). As observed, most ACQR intervals lie around 0.4, showing overall conservative bounds. In contrast, our method produces intervals that adapt with the tumor size (see per-sample visualizations of intervals in appendix, Fig.~\ref{fig:all_volumes}).
Furthermore, the uncertainty should correlate appropriately with segmentation difficulty. For instance, small tumors are hard to detect and segment for their small size, low contrast and susceptibility to noise. 
Empirically, hard samples with small sizes or blurry boundaries tend to yield erroneous predictions (large mean squared error), necessitating wider intervals to ensure coverage. 
To validate this adaptive behavior, we present fine-grained analysis of MSE$_\text{r}$ (error measures) and interval width (uncertainty measures) in Fig.~\ref{fig:adaptive} (b), including NTR in MSD-Task01, NTR in BraTS21 and TKR in KiTS23.
We stratify tumors into small (S), medium (M), and large (L) categories based on the $\frac{1}{3}$ and $\frac{2}{3}$ quantiles of tumor sizes in test-set. As illustrated, our interval widths are associated with the segmentation difficulty: smaller, more challenging tumors receive wider intervals, while larger, easier-to-segment tumors receive narrower intervals. In comparison, CQR is unable to distinguish different uncertainty levels, which prevents it from identifying high-risk predictions. Although both ACQR and \care~shrink their intervals for larger tumors, the extremely wide ACQR interval for small tumors reduces sensitivity to tumor-specific variations.








\subsection{Further Study}
\label{subsed: further}


\paragraph{Tunability and robustness.}
To demonstrate tunability and robustness across different confidence levels, we report NTR coverage rates on varying confidence thresholds in Fig. \ref{fig:valid_theory} (a).
The coverage rate is expected to increase proportionally with the increased confidence level. However, CQR struggles to achieve the desired confidence and ACQR tends to be overconservative as the upper bound \care~(ECE).

\paragraph{Temperature effects.}
Then, we report interval width under different temperature parameters in Fig. \ref{fig:valid_theory} (b), to observe the effect of post-hoc calibration on confidence measures and \care.  
The ECE of necrosis and tumor (ECE$_\text{N, T}$) reflects the miscalibration degree and the average interval width of \care~(ECE) works as the uncertainty measure. For illustration, we scale up ECE by $100$. As observed, both ECE and our interval width decrease as the temperature increases. This indicates that \care~becomes tighter for a well-calibrated model, and vice versa.

\paragraph{Uncertainty decomposition.}
As described in Sec.~\ref{sec:two_bounds}, we decompose uncertainty into miscalibration and intrinsic bias of ratio estimation.
We analyze their contribution empirically by ablation on interval widths. Specifically, we calculate estimation-based ($I_\text{Est}$), V-Bias-based ($I_\mathrm{V-Bias}$) and ECE-based ($I_\text{ECE}$) confidence intervals respectively. The results in Fig.~\ref{fig:valid_theory} (c) show that the miscalibration-based intervals, $I_\mathrm{V-Bias}$ and $I_\mathrm{ECE}$, are much wider than $I_\mathrm{Est}$, indicating that model miscalibration is the primary uncertainty source in ratio estimation.


\paragraph{Uncertainty measure $u(x)$ in ACQR.}
The coverage and adaptiveness of ACQR rely heavily on a dedicated $u(x)$, as discussed in appendix~\ref{subsec:u_x}. In comparison, \care~provides a more straightforward and robust solution, through a clean and principled construction.


\begin{figure}[!h]
\centering
\begin{minipage}[h]{0.33\linewidth}
    \centering
\includegraphics[width=\linewidth]{figure/diff_C/ntr.png}
    \textbf{(a) Confidence levels.}
\end{minipage}
%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{minipage}[h]{0.33\linewidth}
    \centering
\includegraphics[width=\linewidth]{figure/msd/diff_T.png}
    \textbf{(b) Temperature effects.}
\end{minipage}
%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{minipage}[h]{0.32\linewidth}
    \centering
\includegraphics[width=\linewidth]{figure/msd/3_compose.png}
    \textbf{(c) Uncertainty ablation.}
\end{minipage} 
%%%%%%%%%%%%%%%%%%%%%%%%%%
    \caption{Further study on MSD-Task01 and nnUNet$_\text{3d}$ ($C=0.68$).
    (a) \care~satisfies the desired confidence levels consistently. (b) When the temperature moves towards better calibration (ECE $\downarrow$), our interval becomes narrower (Interval $\downarrow$). (c) Miscalibration is the main contributor to the overall uncertainty, since the ECE-only interval $I_\mathrm{ECE}$ takes the dominant portion of the overall interval $I_\mathrm{O}$.}
\label{fig:valid_theory}
\end{figure}



\paragraph{The size of $D_\mathrm{val}$.} 
In our experiments, the validation set of MSD-Task01 has 38 samples, which is already small. Yet, we randomly sample a smaller set from $D_\mathrm{val}$ in Tab.~\ref{tab:re_val}. \care~maintains the confidence level and adaptive intervals even with 10 samples. Moreover, the same distribution assumption is hardly guaranteed for small $D_\mathrm{val}$. Therefore, Tab.~\ref{tab:re_val} indicates our robustness towards smaller validation sets and domain shift.

\paragraph{Grid search.} As shown in Tab.~\ref{tab:grid}, our grid search over the combination of $(\alpha, \delta)$ yields the minimal interval width under the coverage guarantee ($C=0.68$). Theoretically, the coverage rate is guaranteed to over the desired $C$ regardless of the choice of $(\alpha,\delta)$ combination under $\alpha+\delta=1-C$. Nevertheless, to be practical for an informative alarm, we conduct a grid search to find the narrowest confidence intervals.


\begin{table}[!h]
    \centering
    \caption{Sizes of the validation set on MSD-Task01 and nnUNet$_\mathrm{3d}$ ($C=0.68$).}
    \setlength{\tabcolsep}{10pt}
    \begin{tabular}{lcc}
    \hline
    \#Samples & Coverage (\%) & Interval (S/M/L)\\
    \hline
    10 & 85.78$_{\pm 1.68}$ & 0.51$_{\pm 0.09}$/0.26$_{\pm 0.10}$/0.14$_{\pm 0.07}$ \\
    20 & 86.10$_{\pm 1.39}$ & 0.56$_{\pm 0.10}$/0.28$_{\pm 0.12}$/0.13$_{\pm 0.08}$ \\
    30 & 85.92$_{\pm 1.53}$ & 0.42$_{\pm 0.09}$/0.20$_{\pm 0.09}$/0.14$_{\pm 0.08}$ \\
    \hline
    \end{tabular}
    \label{tab:re_val}
\end{table}


%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{table}[!h]
    \centering
    \caption{Grid search on MSD-Task01 and nnUNet$_\mathrm{3d}$ ($C=0.68$).}
    \setlength{\tabcolsep}{10pt}
    \begin{tabular}{lcc}
    \hline
    $(\alpha, \delta)$ & Coverage (\%) & Interval (S/M/L)\\
    \hline
    (0.02, 0.30) & $91.75_{\pm 1.96}$ & 0.50$_{\pm 0.11}$/0.26$_{\pm 0.09}$/0.16$_{\pm 0.07}$ \\
    (0.30, 0.02) & 100.00$_{\pm 0.00}$ & 0.98$_{\pm 0.07}$/0.76$_{\pm 0.08}$/0.41$_{\pm 0.10}$ \\
    Grid & 86.60$_{\pm 1.49}$ & 0.49$_{\pm 0.09}$/0.21$_{\pm 0.07}$/0.12$_{\pm 0.05}$
 \\
    \hline
    \end{tabular}
    \label{tab:grid}
\end{table}

\paragraph{Segmentation metrics.} The Dice score and IoU of the tumor are reported in Tab.~\ref{tab:re_seg}. As tumors grow, the ratio estimation and segmentation performance are declining at different rates, since the ratio estimator accounts for two variables (tumor and necrosis). The overall degradation tendency of segmentation (Dice and IoU in Tab.~\ref{tab:re_seg}) and ratio estimation (MSE in Fig.~\ref{fig:adaptive} (b)) indicates a relationship between tumor size and prediction difficulty. Therefore, our adaptiveness on interval widths is justified.

\paragraph{Inference time.} In consistent with the main paper, we measure the inference time on a single A100 GPU. Tab.~\ref{tab:time} shows the extra inference time of \care, which is negligible regarding the segmentation duration.


\begin{center}
\begin{minipage}{0.52\textwidth}
    \centering
    %% MSD-Task01
    \captionof{table}{Segmentation results.}
    \setlength{\tabcolsep}{5pt}
    \begin{tabular}{lcc}
    \hline
    Tumor$_\mathrm{size}$ & Dice & IoU\\
    \hline
    S & 0.65 & 0.53 \\
    M & 0.76 & 0.65 \\
    L & 0.77 & 0.65 \\
    \hline
    \end{tabular}
    \label{tab:re_seg}
\end{minipage}
%%%%%%%%%%%%%%%%%%%%
\begin{minipage}{0.45\textwidth}
    \centering
    \captionof{table}{Inference time.}
    \setlength{\tabcolsep}{5pt}
    \begin{tabular}{lcc}
    \hline
    Dataset & Seg. (ms) & \care~(ms)\\
    \hline
    MSD-Task01 & 1680.69$_{\pm 1.03}$ & +5.21$_{\pm 0.12}$ \\
    BraTS21 & 1681.69$_{\pm 1.10}$ & +5.25$_{\pm 0.11}$ \\
    \hline
    \end{tabular}
    \label{tab:time}
\end{minipage}
%%%%%%%%%%%%%%%%%%%%
\end{center}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Related Work}
\label{sec:related_work}

\textbf{Uncertainty quantification}
provides statistical methods to estimate prediction uncertainty.
\underline{\textit{Adaptive Conformalized Quantile Regression (ACQR)}} \citep{vovk1999machine,vovk2005algorithmic} constructs prediction intervals that guarantee valid coverage under finite samples, without any distributional assumptions. Its key strength is the distribution-free nature and finite-sample validity, providing strong theoretical guarantees regardless of the base predictive model. 
\underline{\textit{Resampling methods}} 
 are non-parametric techniques for estimating the sampling distribution of a statistic, applicable when the underlying distribution is unknown or difficult to derive.
Specifically, \underline{\textit{Bootstrapping}} \citep{mooney1993bootstrapping,freedman1981bootstrapping} repeatedly samples $N$ data points with replacement from the original data, whereas \underline{\textit{subsampling}} \citep{politis1994large} takes a subset of the original data without replacement, repeating the process multiple times to construct an empirical distribution of the statistic.
% By generating multiple resampled datasets, it allows for the estimation of confidence intervals and standard errors, providing a practical way to quantify uncertainty. 
\underline{\textit{Bayesian methods}} achieve robust segmentation by averaging multiple predictions, using techniques like deep ensemble \citep{lakshminarayanan2017simple} and Monte Carlo dropout \citep{srivastava2014dropout}. 

\textbf{Calibration error } estimation has attracted extensive research attention \citep{kull2015novel, vaicenavicius2019evaluating, kumar2019verified, zhang2020mix, popordanoska2022consistent, gruber2022better}. 
In medical segmentation, classwise and canonical calibration error are used to evaluate per-structure and overall calibration levels. 
Derived from individual channel masks, the classwise CE in multi-class segmentation simplifies to binary CE for each channel.
In addition, \cite{popordanoska2021relationship} proves that the absolute value of volume bias (V-Bias) is upper-bounded by CE. 
Many calibration methods like temperature scaling \citep{guo2017calibration} and isotonic regression \citep{zadrozny2002transforming} have been proposed to improve the calibration of classification scores. However, no previous work analyzes how miscalibration affects downstream ratio-based estimates.

\section{Conclusion}
\label{sec:conclusion}
We propose \care, a confidence-aware framework for estimating ratio-based biomarkers from segmentation network outputs. Our method addresses a common limitation of prior works that focus solely on point estimates without confidence guarantees.
We disentangle two key sources of uncertainty, \textit{i.e.}\ network prediction error and statistical bias.
Our empirical findings highlight that miscalibration is a dominant contributor to uncertainty.
Our framework offers several practical advantages: it operates as a model-agnostic plugin module, provides sample-level adaptive uncertainty estimates in a single forward pass without requiring multiple sampling, and allows users to flexibly adjust confidence levels.
In summary, this work represents an important step toward trustworthy deployment of deep learning in clinical settings by providing practitioners with both accurate biomarker estimates and reliable confidence bounds. 

Despite the practical advantages, our work has several limitations. First, we assume that the validation and test sets are drawn from the same distribution. Although it is standard in supervised learning settings, but may not hold under domain shifts.
In practice, domain shifts arise due to differences in scanners, acquisition protocols, or patient populations. As a result, our confidence interval may not remain valid in these scenarios. 
Addressing this challenge with label-free calibration error estimators (e.g.~\citet{wang2020transferable,popordanoska2024lascal}) is a promising direction for future work. Second, the quality of the calibration of the underlying segmentation network has an impact on the tightness of the derived confidence intervals. Specifically, when the calibration error is large, the resulting confidence intervals may become overly conservative. Improving calibration in segmentation networks would directly translate into narrower, more informative confidence intervals within our approach.
Finally, while our framework shows good performance on public datasets, clinical validation is needed to assess its real-world impact on decision-making and patient outcomes.

Despite limitations, we have shown the first confidence-aware method for estimating confidence intervals in imaging-based ratio biomarkers. 
Compared with existing baselines, our method yields intervals that are both tight and adaptive. We believe this provides a solid foundation for the next generation of AI systems capable of propagating uncertainty throughout the entire deep-learning-based biomarker estimation pipeline.



\midlacknowledgments{
This research received funding from the Flemish Government (AI Research Program) and the Research Foundation
Flanders (FWO) through project number G0G2921N.
Aleksei was supported by Sigrid Juselius Foundation and the Finnish Research Council (Profi6 336449
funding program), the strategic funding of the University of Oulu.
We acknowledge EuroHPC JU for awarding the project ID EHPC-AIF-2025SC02-042 access to the EuroHPC supercomputer LEONARDO, hosted by CINECA (Italy) and the LEONARDO consortium.
}


\bibliography{main}

\appendix

\begin{center}
    \Large\textbf{Appendix}
\end{center}




\renewcommand{\thefigure}{\Alph{figure}}  
\renewcommand{\thetable}{\Alph{table}} 
\setcounter{figure}{0}
\setcounter{table}{0}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\noindent

\begin{figure}[ht!]
\centering
\includegraphics[width=0.95\textwidth]{figure/fig3/fig3_pipeline.pdf}
    \caption{{Our confidence interval considering estimation and miscalibration.} \textbf{\textcolor{red}{(a)}} shows Markov bounds from the estimator. \textbf{\textcolor{blue}{(b)}} illustrates the prediction offset $\epsilon_{l,u}$ due to miscalibration. \textbf{\textcolor[HTML]{ba68c8}{(c)}} is the overall confidence interval $r \in [B_l,B_u]$.}
    \label{fig:range_curve}
\end{figure}
A recap of our idea is shown in Fig.~\ref{fig:range_curve}.
In Appendix~\ref{sec:appendix_exp}, we further illustrate experimental details and present additional experimental results, relevant to our methodology and in support of the main paper.
In Appendix~\ref{sec:proofs}, we offer the proofs of propositions in the main paper. Finally, we give related work in Appendix \ref{sec:app_related_work}.



\section{Experiments}
\label{sec:appendix_exp}


\subsection{Experimental Details}
\label{subsec:impl}

\paragraph{Datasets.} 
MSD-Task01 \citep{antonelli2022medical} and BraTS21 \citep{baid2021rsna} include 484 and 1251 MRI volumes respectively, with four modalities (T1, T2, T1ce, FLAIR) and four annotations (edema, necrosis, enhancing tumor, background). KiTS23 \citep{myronenko2023automated} is a CT dataset of 489  kidney volumes, with four annotations (tumor, kidney, cyst, background).
A nested five-fold cross-validation is used for all datasets. In the outer loop, four folds are used for training and validation, and the remaining one fold for testing. Within the inner loop, 10\% of the training data is held out as a validation set $\mathcal{D}_\text{val}$ to estimate the quantile of V-Bias and ECE.


\paragraph{Segmentation models.} 
We conduct experiments using nnUNet \citep{isensee2021nnu}, nnFormer \citep{zhou2021nnformer} and UNETR++ \citep{zhou2021nnformer}.
All models are trained using cross-entropy (XE) \citep{bishop2006pattern} and soft Dice (SD) \citep{milletari2016v} loss, label-based supervision and softmax activation under a single A100 GPU. 


\paragraph{Implementation details.} 
For conformal prediction \citep{vovk1999machine,papadopoulos2002inductive}, we take the ($(\frac{n+1}{n})\cdot0.68$) quantile of absolute error residual $e_r$ from the validation set as the half-width (Prop.~\ref{prop:CP}), while for \care, we adopt dynamic V-Bias quantiles or ECE quantiles by conducting a grid search under the constraint of $1-\alpha-\beta=0.68$ (Prop.~\ref{prop:bound_all}). 
For Bayesian methods, conducting numerous forward passes to estimate a ``tunable'' quantile is computationally impractical; thus, we report the results of three standard deviations ($3\sigma$). Ensemble intervals are obtained from $K$ models trained with different seeds, and dropout intervals come from $K$ forward passes ($K=20$). 
To implement sampling-based methods, we repeatedly sample pixels from an instance and calculate its ratio estimate for 100 times, then adopt the $[0.16, 0.84]$ quantile from 100 repetitions as the 0.68 confidence level. Specifically, for a volume of $N$ pixels, we take $0.1N$ random pixels each time without replacement for subsampling \citep{politis1994large}, and sample $N$ pixels with replacement each time for bootstrapping \citep{mooney1993bootstrapping}.

\begin{figure}[!bh]
    \centering
\begin{minipage}[b]{0.49\linewidth}
    \centering
    \includegraphics[width=\linewidth]{figure/msd_all/CP_uniform_est.png}
    \textbf{(a) CQR}
\end{minipage}
\begin{minipage}[b]{0.49\linewidth}
    \centering
    \includegraphics[width=\linewidth]{figure/msd_all/CP_adaptive_est_max_tumor_size.png}
    \textbf{(b) ACQR}
\end{minipage}
\begin{minipage}[b]{0.49\linewidth}
    \centering
    \includegraphics[width=\linewidth]{figure/msd_all/v_bias_est.png}
    \textbf{(c) \care~(V-Bias)}
\end{minipage}
\begin{minipage}[b]{0.49\linewidth}
    \centering
    \includegraphics[width=\linewidth]{figure/msd_all/bins15_est.png}
    \textbf{(d) \care~(ECE)}
\end{minipage}
    \caption{{Visualization of our confidence intervals on MSD and nnUNet$_\text{3d}$}. The x-axis represents all test samples sorted by predicted ratio $\hat{r}$, and the y-axis displays the valid range of ratio estimates. }
    \label{fig:all_volumes}
\end{figure}

\subsection{Coverage Guarantee and Adaptiveness}
\label{subsec:vis}
In the main paper, we just give the overall confidence intervals histogram in Fig.~\ref{fig:adaptive}~(a).
To provide a more comprehensive, “bird-eye” view of our method’s behavior, we extend this analysis to the whole test samples in Fig.~\ref{fig:all_volumes}, where we plot $\hat{r}$ and the confidence intervals under four methods. For clarity, the sample indices are omitted. 
As shown in Fig.~\ref{fig:all_volumes}~(a), CQR has symmetric half bandwidths and nearly uniform interval widths, which disables the identification function. ACQR (b) provides adaptive intervals while behaving overconservative.
In comparison, our \care~shows adaptive intervals with desired distinction, which is particularly important in clinical settings to provide a reliable and informative reference.




\subsection{Ablation on $u(x)$}
\label{subsec:u_x}

In Remark \ref{rmk:u_size}, we assume a known maximum tumor size and set a scaling factor $\lambda=\frac{1}{2q_{s_r,\delta}}$ for the uncertainty measure $u(x)$, which extends the ACQR distribution across $[0,1]$. Here, we show two variants of $u(x)$: 
(i) without $\lambda$, a less well-designed $u(x)$; 
(ii) with voxel size $V$, \textit{i.e.} $u(x)=1-\frac{V_\mathrm{T}}{V}$, assuming unknown max. tumor size. 
Since the tumor size is much smaller than the whole voxel size, we adopt $\frac{1}{8}V$ as the denominator for the second variant. 
Following the format in Fig.~\ref{fig:adaptive}~(b) and Fig.~\ref{fig:all_volumes}, we report these results in Fig.~\ref{fig:u_x}. Compared with our implementation in the main paper, both two variants are less adaptive while yielding narrow intervals. The prior of the voxel size is easier to obtain than maximum tumor size. However, the common but less informative prior "dilutes" the adaptiveness of ACQR, for its nearly uniform intervals. Fig.~\ref{fig:u_x} further indicates the significant role of $u(x)$ on ACQR performance, which is also the drawback for wider application.

\begin{figure}[!h]
\centering

\begin{minipage}[b]{0.75\linewidth}
    \centering
\includegraphics[width=\linewidth]{figure/u_x/interval.pdf}
    \textbf{(a) Interval visualization.}
\end{minipage}
\hspace{-0.3cm}
%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{minipage}[b]{0.24\linewidth}
    \centering
\includegraphics[width=0.85\linewidth]{figure/u_x/hist.pdf}
\\
    \textbf{(b) Histogram.}
\end{minipage}
%%%%%%%%%%%%%%%%%%%%%%%%%%
    \caption{{Ablation study on $u(x)$.} "w/o $\lambda$" means $u(x)=1-\frac{V_\mathrm{T}}{V_\mathrm{T,.max}}$; "w/ V" means $u(x)=1-\frac{V_\mathrm{T}}{V}$. Both methods provide limited spans of confidence intervals, where all interval widths are below 0.2.}
\label{fig:u_x}
\end{figure}




\section{Proofs}
\label{sec:proofs} 

In this section, we first give the corresponding proof of the scaling factor $\lambda$(\ref{subsec:proof_u}) mentioned in Remark.~\ref{rmk:u_size}. Then we show the proof of Markov bounds (\ref{subsec:proof_variance}) and miscalibration bounds (\ref{subsec:proof_volume}) mentioned in Sec. \ref{sec:two_bounds}.
Finally, we derive a debiased estimator in Sec. \ref{subsec:second_order}.


\subsection{Uncertainty Measures in Tumors}
\label{subsec:proof_u}

Recall that in ACQR, the confidence interval for a ratio-based biomarker \(r(x)\) is defined as
$\hat{r}(x) \pm u(x) q_{s_r,\delta}$,
where \(q_{s_r,\delta}\) is the $(\frac{n+1}{n})\delta$-quantile of the score \(s_r\). We choose the uncertainty measure $u(x) = \lambda \left(1 - \frac{V_\mathrm{T}}{V_\mathrm{T,max} + \epsilon}\right)$.

%Considering $\epsilon$ is set to be $10^{-7}$, 
The maximum possible width $I_\mathrm{max}$ occurs when $V_\mathrm{T} \to 0$: 
\begin{equation}
I_\mathrm{max} = 2\cdot u_\mathrm{max} \, q_{s_r,\delta} = 2 \cdot \lambda q_{s_r,\delta}.
\end{equation}
As the ratio is always in $[0,1]$: 
\begin{equation}
2 \cdot \lambda  q_{s_r,\delta} = 1\quad \Longrightarrow \quad \lambda = \frac{1}{2 q_{s_r,\delta}}.
\end{equation}



\subsection{Markov Bounds}
\label{subsec:proof_variance}

\cite{van2000mean} provides a confidence interval of the ratio estimator $\frac{\bar{y}}{\bar{x}}$ based on asymptotic normal assumptions and by using the variance $\sigma_r^2 \coloneqq \operatorname{Var} \left( \frac{\bar{y}}{\bar{x}} \right)$.
However, adopting their results 
assumes that all pixels are independently and identically distributed, i.e., $\left(x_1, y_1 \right), \dots, \left( x_n, y_n \right) \overset{\text{i.i.d.}}\sim \mathbb{P}_{xy}$.
In addition, they perform multiple approximation steps, and some approximations happen within the square operator.
How the estimator behaves facing a violation of these assumptions is unknown in practice.
In the following, we prove the alternative approach, we proposed in the main paper, which is based on Markov's inequality \citep{resnick2003probability}.
For conciseness, the ``$\approx$'' sign is avoided while we directly note the remainder terms for a rigorous analysis.

To avoid relying on any distribution assumptions, we construct a confidence interval via Markov's inequality for the estimator $\hat{r}=\frac{\bar{y}}{\bar{x}}$ and target $r=\frac{\mu_y}{\mu_x}$.
We have
\begin{equation}
    \mathbb{P} \left( \left\lvert \hat{r} - r\right\rvert \geq k \mathrm{\sqrt{SE_{\hat{r}}}}\right) 
    = \mathbb{P} \left( \left( \hat{r} - r\right)^2 \geq k^2\mathrm{SE_{\hat{r}}} \right)
    \leq 
    \frac{1}{k^2}
\end{equation}
with the squared error 
$\mathrm{SE_{\hat{r}}} \coloneqq \mathbb{E} \left[ \left(\hat{r}-r \right)^2 \right]$.
We emphasize that in general $\mathrm{\sqrt{SE_{\hat{r}}}} \neq \sigma_r$.


In main paper, we denote $\alpha\coloneqq
\frac{1}{k^2}$ as the non-coverage probability.
For instance, adopting the $1-\alpha=75\%$ confidence interval corresponds to $\alpha=\frac{1}{k^2}=0.25$ or $k=2$. 
Then the half-width of confidence interval is $2\mathrm{\sqrt{SE_{\hat{r}}}}$, \textit{i.e.}, two times the root squared error.
This is more conservative than using the normal assumption, but requires no distribution assumption.

Now, we compute the squared error via Taylor expansion \citep{spivak2006calculus}.
First, note that
\begin{equation}
\begin{split}
    % \operatorname{MSE_r} = 
    \mathrm{{SE_{\hat{r}}}} = 
    \mathbb{E} \left[ \left(\frac{\bar{y}}{\bar{x}}-\frac{\mu_y}{\mu_x} \right)^2 \right]=
    \mathbb{E} \left[ \frac{\bar{y}^2}{\bar{x}^2} \right] - 2 \frac{\mu_y}{\mu_x} \mathbb{E} \left[ \frac{\bar{y}}{\bar{x}} \right]  + \frac{\mu_y^2}{\mu_x^2}.
\label{eq:mse_expansion}
\end{split}
\end{equation}

We perform a Taylor expansion of $\frac{\bar{y}^2}{\bar{x}^2}$ around $\frac{\mu_y}{\mu_x}$ to compute its expectation:

\begin{equation}
\begin{split}
    \frac{\bar{y}^2}{\bar{x}^2} & = \frac{\mu_y^2}{\mu_x^2} + 2 \left( \bar{y} - \mu_y \right) \frac{\mu_y}{\mu_x^2} - 2 \left( \bar{x} - \mu_x \right) \frac{\mu_y^2}{\mu_x^3} \\
    & \quad + \left( \bar{y} - \mu_y \right)^2 \frac{1}{\mu_y} + 3 \left( \bar{x} - \mu_x \right)^2 \frac{\mu_y^2}{\mu_x^4} - 4 \left(\bar{y} - \mu_y \right) \left( \bar{x} - \mu_x \right) \frac{\mu_y}{\mu_x^3} \\
    & \quad + \sum_{i,j \colon\; i+j\geq 3} \left( \bar{y} - \mu_y \right)^i \left( \bar{x} - \mu_x \right)^j \frac{\partial^{i+j}}{\partial^i \mu_y \partial^j \mu_x} \frac{\mu_y^2}{\mu_x^2} 
\end{split}
\end{equation}
from which follows
\begin{equation}
\begin{split}
    \mathbb{E} \left[ \frac{\bar{y}^2}{\bar{x}^2} \right] & = \frac{\mu_y^2}{\mu_x^2} + \frac{\operatorname{Var} \left( \bar{y} \right)}{\mu_y} + 3 \operatorname{Var} \left( \bar{x} \right) \frac{\mu_y^2}{\mu_x^4} - 4 \operatorname{Cov} \left( \bar{x}, \bar{y} \right)  \frac{\mu_y}{\mu_x^3} \\
    & \quad + \sum_{i,j \colon\; i+j\geq 3} \mathbb{E} \left[ \left( \bar{x} - \mu_x \right)^i \left( \bar{y} - \mu_y \right)^j \right] \frac{\partial^{i+j}}{\left( \partial \mu_x \right)^i \left( \partial \mu_y \right)^j} \frac{\mu_y^2}{\mu_x^2}.
\end{split}
\end{equation}
Assuming $\left(x_1, y_1 \right), \dots, \left(x_n, y_n \right) \sim \mathbb{P}_{xy}$ are i.i.d. further simplifies the terms, like in the following.
Markov's inequality does not require this assumption, so a violation does not invalidate our approach.
Then, it holds that
\begin{equation}
    \operatorname{Var} \left( \bar{x} \right) = \frac{1}{n} \operatorname{Var} \left( x \right), \quad \operatorname{Var} \left( y \right) = \frac{1}{n} \operatorname{Var} \left( y \right), \quad \operatorname{Cov} \left( \bar{x}, \bar{y} \right) = \frac{1}{n} \operatorname{Cov} \left( x, y \right).
\label{eq:expectation_2nd_moments}
\end{equation}
Further, for all $a=1, \dots n$ let $z_{k,a} = x_a$ and $\mu_{z_k} = \mu_x$ if $1 \leq k\leq i$, and $z_{k,a} = y_a$ and $\mu_{z_k} = \mu_y$ if $i < k \leq m \coloneqq i+j$.
Then
\begin{equation}
\begin{split}
    & \mathbb{E} \left[ \left( \bar{x} - \mu_x \right)^i \left( \bar{y} - \mu_y \right)^j \right] \\
    & = \frac{1}{n^{i+j}} \mathbb{E} \left[ \left( \sum_{a=1}^n x_a - \mu_x \right)^i \left( \sum_{a=1}^n y_a - \mu_y \right)^j \right] \\
    & = \frac{1}{n^{m}} \mathbb{E} \left[ \prod_{k=1}^m \left( \sum_{a=1}^n z_{k,a} - \mu_{z_k} \right) \right] \\
    & = \frac{1}{n^{m}} \sum_{l=1}^m \sum_{a_l=1}^n \mathbb{E} \left[ \prod_{k=1}^m \left( z_{k,a_k} - \mu_{z_k} \right) \right] \\
\end{split}
\end{equation}
For all $a_k$ holds that $\mathbb{E} \left[ \prod_{k=1}^m \left( z_{k,a_k} - \mu_{z_k} \right) \right] = 0$ if there exists any non-duplicate index value, due to independence.
It follows that we can reduce the number of indices by at least half, which reduces the number of addends by a polynomial:
\begin{equation}
\begin{split}
    & \frac{1}{n^{m}} \underbrace{\sum_{l=1}^m \sum_{a_l=1}^n \mathbb{E} \left[ \prod_{k=1}^m \left( z_{k,a_k} - \mu_{z_k} \right) \right]}_{n^m \text{ addends}} \\
    & = \frac{1}{n^{m}} \underbrace{\sum_{l=1}^{\lfloor m/2 \rfloor} \sum_{a_l=1}^n \mathbb{E} \left[ \prod_{k=1}^m \left( z_{k,a_k} - \mu_{z_k} \right) \right]}_{n^{\lfloor m/2 \rfloor} \text{ addends}} \\
    & = \frac{1}{n^{\lceil m/2 \rceil}} \underbrace{\frac{1}{n^{\lfloor m/2 \rfloor}} \sum_{l=1}^{\lfloor m/2 \rfloor} \sum_{a_l=1}^n \mathbb{E} \left[ \prod_{k=1}^m \left( z_{k,a_k} - \mu_{z_k} \right) \right]}_{=: C_{ij}}. \\
\label{eq:expectation_higher_moments}
\end{split}
\end{equation}
Note that $C_{ij} \in \left[ -B_{m}, B_m \right]$
with $B_m \coloneqq \max_{\left\{ i,j=0,\dots m \mid i+j \leq m \right\}} \left\lvert \mathbb{E} \left[ \left( x - \mu_x \right)^i \left( y - \mu_y \right)^j \right] \right\rvert$, therefore, the convergence rate depends not only on the data size $n$ but also on how the moments grow with $m$.

Using Eqn.~\ref{eq:expectation_2nd_moments} and Eqn.~\ref{eq:expectation_higher_moments} gives
\begin{equation}
\begin{split}
    \mathbb{E} \left[ \frac{\bar{y}^2}{\bar{x}^2} \right] & = \frac{\mu_y^2}{\mu_x^2} + \frac{\operatorname{Var} \left( y \right)}{n\mu_y} + 3 \operatorname{Var} \left( x \right) \frac{\mu_y^2}{n \mu_x^4} - 4 \operatorname{Cov} \left( x, y \right)  \frac{\mu_y}{n \mu_x^3} \\
    & \quad + \sum_{i,j \colon\; i+j\geq 3} \frac{1}{n^{\lceil \left(i+j\right)/2 \rceil}} C_{ij} \frac{\partial^{i+j}}{\left( \partial \mu_x \right)^i \left( \partial \mu_y \right)^j} \frac{\mu_y^2}{\mu_x^2}.
\label{eq:expectation_ybar2/xbar2}
\end{split}
\end{equation}
% Under the assumption that $\left\lvert \mathbb{E} \left[ \left( x - \mu_x \right)^i \left( x - \mu_x \right)^j \right] \right\rvert$ does not grow too quickly for either $i \to \infty$ or/and $j \to \infty$

Similarly, we use Taylor expansion for $\frac{\bar{y}}{\bar{x}}$ around $\frac{\mu_y}{\mu_x}$ to get
\begin{equation}
\begin{split}
    \frac{\bar{y}}{\bar{x}} & = \frac{\mu_y}{\mu_x} + \left( \bar{y} - \mu_y \right) \frac{1}{\mu_x} - \left( \bar{x} - \mu_x \right) \frac{\mu_y}{\mu_x^2} \\
    & \quad + 0 + \left( \bar{x} - \mu_x \right)^2 \frac{\mu_y}{\mu_x^3} - \left(\bar{y} - \mu_y \right) \left( \bar{x} - \mu_x \right) \frac{1}{\mu_x^2} \\
    & \quad + \sum_{i,j \colon\; i+j\geq 3} \left( \bar{y} - \mu_y \right)^i \left( \bar{x} - \mu_x \right)^j \frac{\partial^{i+j}}{\left( \partial \mu_x \right)^i \left( \partial \mu_y \right)^j} \frac{\mu_y}{\mu_x},
\end{split}
\end{equation}
which results in
\begin{equation}
\begin{split}
    \frac{\mu_y}{\mu_x} \mathbb{E} \left[ \frac{\bar{y}}{\bar{x}} \right] & = \frac{\mu_y^2}{\mu_x^2} + \operatorname{Var} \left( \bar{x} \right) \frac{\mu_y^2}{\mu_x^4} - \operatorname{Cov} \left(\bar{y}, \bar{x} \right) \frac{\mu_y}{\mu_x^3} \\
    & \quad + \sum_{i,j \colon\; i+j\geq 3} \mathbb{E} \left[ \left( \bar{y} - \mu_y \right)^i \left( \bar{x} - \mu_x \right)^j \right] \frac{\mu_y}{\mu_x} \frac{\partial^{i+j}}{\left( \partial \mu_x \right)^i \left( \partial \mu_y \right)^j} \frac{\mu_y}{\mu_x} \\
    & = \frac{\mu_y^2}{\mu_x^2} + \operatorname{Var} \left( x \right) \frac{\mu_y^2}{n \mu_x^4} - \operatorname{Cov} \left(x, y \right) \frac{\mu_y}{n \mu_x^3} \\
    & \quad + \sum_{i,j \colon\; i+j\geq 3} \frac{1}{n^{\lceil \left(i+j\right)/2 \rceil}} C_{ij} \frac{\mu_y}{\mu_x} \frac{\partial^{i+j}}{\left( \partial \mu_x \right)^i \left( \partial \mu_y \right)^j} \frac{\mu_y}{\mu_x}. \\
\label{eq:expectation_bary/barx}
\end{split}
\end{equation}

Inserting Eqn.~\ref{eq:expectation_ybar2/xbar2} and Eqn.~\ref{eq:expectation_bary/barx} into Eqn.~\ref{eq:mse_expansion} results in
\begin{equation}
\begin{split}
    % \operatorname{MSE_r} 
    \mathrm{SE_{\hat{r}}}
    & = 
    2 \frac{\mu_y^2}{\mu_x^2} + \frac{\operatorname{Var} \left( y \right)}{n \mu_x} + 3 \operatorname{Var} \left( x \right) \frac{\mu_y^2}{n \mu_x^4} - 4 \operatorname{Cov} \left( x, y \right)  \frac{\mu_y}{n \mu_x^3} \\
    & \quad + \sum_{i,j \colon\; i+j\geq 3} \frac{1}{n^{\lceil \left(i+j\right)/2 \rceil}} C_{ij} \frac{\partial^{i+j}}{\left( \partial \mu_x \right)^i \left( \partial \mu_y \right)^j} \frac{\mu_y^2}{\mu_x^2} \\
    & \quad - 2 \Big( \frac{\mu_y^2}{\mu_x^2} + \operatorname{Var} \left( x \right) \frac{\mu_y^2}{n \mu_x^4} - \operatorname{Cov} \left(x, y \right) \frac{\mu_y}{n \mu_x^3} \\
    & \quad + \sum_{i,j \colon\; i+j\geq 3} \frac{1}{n^{\lceil \left(i+j\right)/2 \rceil}} C_{ij} \frac{\mu_y}{\mu_x} \frac{\partial^{i+j}}{\left( \partial \mu_x \right)^i \left( \partial \mu_y \right)^j} \frac{\mu_y}{\mu_x} \Big) \\
    & = \frac{1}{n} \left( \frac{\operatorname{Var} \left( y \right)}{\mu_x} + \operatorname{Var} \left( x \right) \frac{\mu_y^2}{ \mu_x^4} - 2 \operatorname{Cov} \left( x, y \right)  \frac{\mu_y}{ \mu_x^3} \right) \\
    & \quad + \underbrace{\sum_{i,j \colon\; i+j\geq 3} \frac{1}{n^{\lceil \left(i+j\right)/2 \rceil}} C_{ij} \left( \frac{\partial^{i+j}}{\left( \partial \mu_x \right)^i \left( \partial \mu_y \right)^j} \frac{\mu_y^2}{\mu_x^2} - \frac{2 \mu_y}{\mu_x} \frac{\partial^{i+j}}{\left( \partial \mu_x \right)^i \left( \partial \mu_y \right)^j} \frac{\mu_y}{\mu_x}\right)}_{ \in O \left( \frac{1}{n^2} \right)}. \\
\end{split}
\end{equation}
Consequently, we may estimate $\mathrm{SE_{\hat{r}}}$ via
% % \operatorname{MSE}}_r
\begin{equation}
    \widehat{
    \mathrm{SE}}_{\mathrm{\hat{r}}}
    \coloneqq \frac{1}{n} \left(\frac{\hat{\mu}_y \hat{\sigma}^2_x}{\hat{\mu}_x^4} + \frac{\hat{\sigma}^2_y}{\hat{\mu}_x} - 2 \frac{\hat{\mu}_y \hat{\sigma}_{xy}}{\hat{\mu}_x^3} \right),
\end{equation}
which is consistent since the estimators $\hat{\mu}_y = \frac{1}{n} \sum_i y_i$, $\hat{\mu}_x = \frac{1}{n} \sum_i x_i$, $\hat{\sigma}_y^2 = \frac{1}{n-1} \sum_i \left( y_i - \hat{\mu}_y \right)^2$, $\hat{\sigma}_x^2 = \frac{1}{n-1} \sum_i \left( x_i - \hat{\mu}_x \right)^2$, and $\hat{\sigma}_{xy} = \frac{1}{n-1} \sum_i \left( x_i - \hat{\mu}_x \right)\left( y_i - \hat{\mu}_y \right)$ are consistent as well.





\subsection{Volume to Ratio Confidence Intervals}
\label{subsec:proof_volume}


\begin{proposition}[Calibration-based Confidence Interval]
\label{prop:bound_miscal} 
Consider a segmentation model $g(z) = (g_A (z), g_B (z))$ with the random variable $z$ representing pixel inputs of instance $I$, and targets $y_A$ and $y_B$.
On a validation (calibration) set $\mathcal{D}_\text{cal}$, define $q_{A, \delta/2}$ and $q_{B, \delta/2}$ as the $\frac{n+1}{n}(1-\frac{\delta}{2})$ quantile of the instance-wise volume bias or calibration errors of $g_A$ and $g_B$.
Then, it holds with at least $1-\delta$ probability that
\begin{align}
\frac{\mathbb{E} \left[ y_A \mid I \right]}{\mathbb{E} \left[ y_B \mid I \right]} \in \left[ \frac{\mathbb{E} \left[ g_A(z) \mid I \right]}{\mathbb{E} \left[ g_B(z) \mid I \right]} - \epsilon_{l,\delta}, \frac{\mathbb{E} \left[ g_A(z) \mid I \right]}{\mathbb{E} \left[ g_B(z) \mid I \right]} + \epsilon_{u,\delta} \right],
\label{eq:bound_ce}
\end{align}
where $\epsilon_{l,\delta} \coloneqq \frac{\mathbb{E} \left[ g_A \left( z \right) \right]}{\mathbb{E} \left[ g_B \left( z \right) \right]}-\frac{\mathbb{E} \left[ g_A \left( z \right) \right] - q_{A, \delta/2}}{\mathbb{E} \left[ g_B \left( z \right) \right] + q_{B, \delta/2}}$, 
$\epsilon_{u,\delta} \coloneqq\frac{\mathbb{E} \left[ g_A \left( z \right) \right] + q_{A, \delta/2}}{\mathbb{E} \left[ g_B \left( z \right) \right] - q_{B, \delta/2}}-\frac{\mathbb{E} \left[ g_A \left( z \right) \right]}{\mathbb{E} \left[ g_B \left( z \right) \right]}$
are the widths of the lower and upper calibration bounds, respectively.
\end{proposition}
In experiments, \care~(V-Bias) takes the quantile of |V-Bias| \citep{popordanoska2021relationship} as $q_{A, B}$ while \care~(ECE) considers ECE \citep{guo2017calibration} quantiles.
To combine both intervals, we make the following statement, which is analogous to multiple testing.
This way, we can consider both uncertainties in practice.




Note that if $a \in\!\!\!\!\!/ \left[b, c \right] \subseteq \mathbb{R}_{>0}$ then $\frac{1}{a} \in\!\!\!\!\!/ \left[\frac{1}{c}, \frac{1}{b} \right]$ since $x \mapsto \frac{1}{x}$ is strictly negative monotone.
We also make use of the subadditivity of probability measures \citep{resnick2003probability} given by
\begin{equation}
    \mathbb{P} \left( \bigcup_i A_i \right) \leq \sum_i \mathbb{P} \left( A_i \right).
\end{equation}
This is also known as Boole's inequality. In the following, we 
denote the random variable $z$ as the pixel inputs of image instance $I$.
As described in the main paper, $q_{A,\alpha}$ and $q_{B,\alpha}$ are empirically determined on a validation set as the $1-\alpha$ quantile of the image-wise calibration errors for $g_A$ and $g_B$.
Then, for $\alpha \in [0,1]$ it holds that
\begin{equation}
\begin{split}
    \alpha & = \frac{\alpha}{2} + \frac{\alpha}{2} \\
    & \geq \mathbb{P} \left( \operatorname{CE}_{A,I} \geq q_{A, \alpha/2} \right) + \mathbb{P} \left( \operatorname{CE}_{B,I} \geq q_{B, \alpha/2} \right) \\
    & \geq \mathbb{P} \left( \left\lvert \mathbb{E} \left[ Y_A \mid I \right] - \mathbb{E} \left[ g_A \left( z \right) \mid I \right] \right\rvert \geq q_{A, \alpha/2} \right) + \mathbb{P} \left( \left\lvert \mathbb{E} \left[ Y_B \mid I \right] - \mathbb{E} \left[ g_B \left( z \right) \mid I \right] \right\rvert \geq q_{B, \alpha/2} \right) \\
    & \geq \mathbb{P} \left( \left\lvert \mathbb{E} \left[ Y_A \mid I \right] - \mathbb{E} \left[ g_A \left( z \right) \mid I \right] \right\rvert \geq q_{A, \alpha/2} \lor \left\lvert \mathbb{E} \left[ Y_B \mid I \right] - \mathbb{E} \left[ g_B \left( z \right) \mid I \right] \right\rvert \geq q_{B, \alpha/2} \right) \\
    & = \mathbb{P} \Big( \mathbb{E} \left[ Y_A \mid I \right] \notin [\mathbb{E} \left[ g_A \left( z \right) \mid I \right] - q_{A, \alpha}, \mathbb{E} \left[ g_A \left( z \right) \mid I \right] + q_{A, \alpha}] \\
    & \quad \lor \mathbb{E} \left[ Y_B \mid I \right] \notin [\mathbb{E} \left[ g_B \left( z \right) \mid I \right] - q_{B, \alpha}, \mathbb{E} \left[ g_B \left( z \right) \mid I \right] + q_{B, \alpha}] \Big) \\
    & = \mathbb{P} \Big( \mathbb{E} \left[ Y_A \mid I \right] \notin [\mathbb{E} \left[ g_A \left( z \right) \mid I \right] - q_{A, \alpha}, \mathbb{E} \left[ g_A \left( z \right) \mid I \right] + q_{A, \alpha}] \\
    & \quad \lor \frac{1}{\mathbb{E} \left[ Y_B \mid I \right]} \notin \left[\frac{1}{\mathbb{E} \left[ g_B \left( z \right) \mid I \right] + q_{B, \alpha}}, \frac{1}{\mathbb{E} \left[ g_B \left( z \right) \mid I \right] - q_{B, \alpha}} \right] \Big) \\
    & \geq \mathbb{P} \left( \frac{\mathbb{E} \left[ Y_A \mid I \right]}{\mathbb{E} \left[ Y_B \mid I \right]} \notin \left[\frac{\mathbb{E} \left[ g_A \left( z \right) \mid I \right] - q_{A, \alpha}}{\mathbb{E} \left[ g_B \left( z \right) \mid I \right] + q_{B, \alpha}}, \frac{\mathbb{E} \left[ g_A \left( z \right) \mid I \right] + q_{A, \alpha}}{\mathbb{E} \left[ g_B \left( z \right) \mid I \right] - q_{B, \alpha}} \right] \right).
\end{split}
\end{equation}


It follows that for confidence level $1 - \alpha$ that
\begin{equation}
    \frac{\mathbb{E} \left[ Y_A \mid I \right]}{\mathbb{E} \left[ Y_B \mid I \right]} \in 
    \left[
        \frac{\mathbb{E} \left[ g_A \left( z \right) \mid I \right] - q_{A, \alpha}}
             {\mathbb{E} \left[ g_B \left( z \right) \mid I \right] + q_{B, \alpha}},\,
        \frac{\mathbb{E} \left[ g_A \left( z \right) \mid I \right] + q_{A, \alpha}}
             {\mathbb{E} \left[ g_B \left( z \right) \mid I \right] - q_{B, \alpha}}
    \right]
\end{equation}


Given the previous equation, it further holds that
\begin{equation}
\begin{split}
    & \delta + \alpha \geq \\
    & \geq \mathbb{P} \Bigg( \frac{\mathbb{E} \left[ Y_A \mid I \right]}{\mathbb{E} \left[ Y_B \mid I \right]} \in\!\!\!\!\!/ 
        \left[ \frac{\mathbb{E} \left[ g_A \left( z \right) \mid I \right]}{\mathbb{E} \left[ g_B \left( z \right) \mid I \right]} - \epsilon_{l,\delta}, 
               \frac{\mathbb{E} \left[ g_A \left( z \right) \mid I \right]}{\mathbb{E} \left[ g_B \left( z \right) \mid I \right]} + \epsilon_{u,\delta} \right] \Bigg) \\
    & \quad + \mathbb{P} \Bigg( \frac{\mathbb{E} \left[ g_A \left( z \right) \mid I \right]}{\mathbb{E} \left[ g_B \left( z \right) \mid I \right]} \in\!\!\!\!\!/ 
        \left[ \frac{\sum_i g_A \left( z_{i,I} \right)}{\sum_i g_B \left( z_{i,I} \right)} - \beta_{r,\alpha}, 
               \frac{\sum_i g_A \left( z_{i,I} \right)}{\sum_i g_B \left( z_{i,I} \right)} + \beta_{r,\alpha} \right] \Bigg) \\
    & \geq \mathbb{P} \Bigg( \frac{\mathbb{E} \left[ Y_A \mid I \right]}{\mathbb{E} \left[ Y_B \mid I \right]} \in\!\!\!\!\!/ 
        \left[ \frac{\mathbb{E} \left[ g_A \left( z \right) \mid I \right]}{\mathbb{E} \left[ g_B \left( z \right) \mid I \right]} - \epsilon_{l,\delta}, 
               \frac{\mathbb{E} \left[ g_A \left( z \right) \mid I \right]}{\mathbb{E} \left[ g_B \left( z \right) \mid I \right]} + \epsilon_{u,\delta} \right] \\
    & \quad\quad \lor \frac{\mathbb{E} \left[ g_A \left( z \right) \mid I \right]}{\mathbb{E} \left[ g_B \left( z \right) \mid I \right]} \in\!\!\!\!\!/ 
        \left[ \frac{\sum_i g_A \left( z_{i,I} \right)}{\sum_i g_B \left( z_{i,I} \right)} - \beta_{r,\alpha}, 
               \frac{\sum_i g_A \left( z_{i,I} \right)}{\sum_i g_B \left( z_{i,I} \right)} + \beta_{r,\alpha} \right] \Bigg) \\
    & \geq \mathbb{P} \Bigg( \frac{\mathbb{E} \left[ Y_A \mid I \right]}{\mathbb{E} \left[ Y_B \mid I \right]} \in\!\!\!\!\!/ 
        \left[ \frac{\sum_i g_A \left( z_{i,I} \right)}{\sum_i g_B \left( z_{i,I} \right)} - \epsilon_{l,\delta} - \beta_{r,\alpha}, 
               \frac{\sum_i g_A \left( z_{i,I} \right)}{\sum_i g_B \left( z_{i,I} \right)} + \epsilon_{u,\delta} + \beta_{r,\alpha} \right] \Bigg). \\
\end{split}
\end{equation}



From this follows that with at least probability $1-\alpha - \delta$ that

\begin{equation}
    \frac{\mathbb{E} \left[ Y_A \mid I \right]}{\mathbb{E} \left[ Y_B \mid I \right]} \in 
    \left[ 
        \frac{\sum_i g_A \left( z_{i,I} \right)}{\sum_i g_B \left( z_{i,I} \right)} - \epsilon_{l,\delta} - \beta_{r,\alpha},\,
        \frac{\sum_i g_A \left( z_{i,I} \right)}{\sum_i g_B \left( z_{i,I} \right)} + \epsilon_{u,\delta} + \beta_{r,\alpha} 
    \right].
\end{equation}


\subsection{Debiased Ratio Estimation}
\label{subsec:second_order}

The naive ratio estimator is biased due to the limited number of samples. Here we extend \cite{popordanoska2022consistent} to derive a debiased ratio estimator to $\mathcal{O}(n^{-2})$. Firstly, the naive estimator is:
\begin{align}
    \hat{r} =\frac{\bar{y}}{\bar{x}} = \frac{\mu_y}{\mu_x} \left(\frac{\bar{y}}{\mu_y}\right)\left(\frac{\bar{x}}{\mu_x}\right)^{-1}=\frac{\mu_y}{\mu_x}\biggl(1 + \frac{\bar{y}-\mu_y}{\mu_y} \biggr)\biggl(1 + \frac{\bar{x}-\mu_x}{\mu_x} \biggr)^{-1}.
\end{align}
Then we expand $\biggl(1 + \frac{\bar{x}-\mu_x}{\mu_x} \biggr)^{-1}$ in Taylor series:
\begin{align}
    \begin{split}
        \hat{r}&= \frac{\mu_y}{\mu_x}\biggl(1 + \frac{(\bar{y}-\mu_y)}{\mu_y} - \frac{(\bar{x}-\mu_x)}{\mu_x} - \frac{(\bar{x}-\mu_x)(\bar{y}-\mu_y)}{\mu_y \mu_x} + \frac{(\bar{x}-\mu_x)^2}{\mu_x^2} \\
        & \phantom{asdfa}+ \frac{(\bar{x}-\mu_x)^2 (\bar{y}-\mu_y)}{\mu_x^2 \mu_y} - \frac{(\bar{x}-\mu_x)^3}{\mu_x^3} - \frac{(\bar{x}-\mu_x)^3 (\bar{y}-\mu_y)}{\mu_x^3 \mu_y}  + \frac{(\bar{x}-\mu_x)^4}{\mu_x^4}  \biggr) + \mathcal{O}(n^{-2.5})
    \end{split}
\end{align}
The bias of $\hat{r}$ defined by $\mathbb{E}[\hat{r}] - r$ is written as:
\begin{align}
    \operatorname{Bias}_r
    &=\frac{\mu_y}{\mu_x}\Biggl(\frac{1}{n}\biggl(\frac{\operatorname{Var}(x)}{\mu_x^2} - \frac{\operatorname{Cov}(x, y)}{\mu_x \mu_y} \biggr) + \frac{1}{n^2}\biggl(\frac{(\operatorname{Cov}(x^2, y) -2\mu_x \operatorname{Cov}(x, y))}{\mu_x^2 \mu_y}\\
        &\phantom{asdfa}-\frac{(\operatorname{Cov}(x^2, x) - 2\mu_x \operatorname{Var}(x))}{\mu_x^3} - \frac{3\operatorname{Var}(x) \operatorname{Cov}(x, y)}{\mu_x^3 \mu_y}+\frac{3 \operatorname{Var}(x)^2}{\mu_x^4}\biggr)\Biggr)
\end{align}
And a second-order debiased estimator is defined by $r_{corr,2}:=\hat{r}-\mathrm{Bias}_r$:
\begin{align}
    {r}_{corr,2} &= \hat{r} - \frac{{\mu_y}}{{\mu_x}}\Biggl(\frac{1}{n}\biggl(\frac{\operatorname{Var}(x)}{\mu_x^2} - \frac{\operatorname{Cov}(x, y)}{\mu_x \mu_y} \biggr) + \frac{1}{n^2}\biggl(\frac{(\operatorname{Cov}(x^2, y) -2\mu_x \operatorname{Cov}(x, y))}{\mu_x^2 \mu_y}\\
        &\phantom{asdfas}-\frac{(\operatorname{Cov}(x^2, x) - 2\mu_x \operatorname{Var}(x))}{\mu_x^3} - \frac{3\operatorname{Var}(x) \operatorname{Cov}(x, y)}{\mu_x^3 \mu_y}+\frac{3 \operatorname{Var}(x)^2}{\mu_x^4}\biggr)\Biggr)
\end{align}
Finally, we use plug-in estimators for empirical estimation:
\begin{align}
    \begin{split}
        \hat{r}_{corr,2} &:=  \frac{\hat{\mu_y}}{\hat{\mu_x}}\Biggl(1-\frac{1}{n}\biggl(r_{b}^{*} - r_{a}^{*} \biggr) - \frac{1}{n^2}\biggl(\frac{\widehat{(\operatorname{Cov}(x^2, y)} -2\widehat{\mu_x} \widehat{\operatorname{Cov}(x, y)})}{\widehat{\mu_x^2} \widehat{\mu_y}}\\
        &\phantom{asdfa}-\frac{(\widehat{\operatorname{Cov}(x^2, x)} - 2\widehat{\mu_x} \widehat{\operatorname{Var}(x)})}{\widehat{\mu_x^3}} - \frac{3\widehat{\operatorname{Var}(x)} \widehat{\operatorname{Cov}(x, y)}}{\widehat{\mu_x^3} \widehat{\mu_y}}+\frac{3 \widehat{\operatorname{Var}(x)}^2}{\widehat{\mu_x^4}}\biggr)\Biggr)
    \end{split}
    \label{eq:L1_corrected_full}
\end{align}
\begin{align}
\begin{split}
    r_{a}^{*} &= \underbrace{\frac{\widehat{\operatorname{Cov}(x, y)}}{\widehat{\mu_x \mu_y}}}_{=r_a}
    \Biggl(1+\frac{1}{(n-1)}\biggl(\frac{\widehat{\mu_y}\widehat{\operatorname{Cov}(x^2, y)}+\widehat{\mu_x}\widehat{\operatorname{Cov}(y^2, x)}}{\widehat{\operatorname{Cov}(x, y)}\widehat{\mu_x}\widehat{ \mu_y}}-4\biggr)\\
    &\phantom{asasasasddasdf}- \frac{1}{(n-1)}\biggl(\frac{\ \widehat{\operatorname{Var}(x)}}{\widehat{\mu_x^2} }+ \frac{ \widehat{\operatorname{Var}(y)}}{ \widehat{\mu_y^2}} + 2\frac{ \widehat{\operatorname{Cov}(x, y)}}{\widehat{\mu_x} \widehat{\mu_y}}\biggr)\Biggr)
\end{split}
\label{eq:r_a_corrected_}
\end{align}

\begin{align}\label{eq:r_b_corrected_}
    r_{b}^{*} = \underbrace{\frac{\widehat{\operatorname{Var}(x)}}{\widehat{\mu_x^2}}}_{=r_b}\Biggl(1+ \frac{4}{(n-1)}\biggl(\frac{\frac{1}{2}\widehat{\operatorname{Cov}(x^2, x)}}{\widehat{\mu_x}\widehat{ \operatorname{Var}(x)}}-1\biggr)-\frac{4}{(n-1)}\frac{\widehat{\operatorname{Var}(x)}}{\widehat{\mu_x^2}}\Biggr).
\end{align}



\section{Related Work}
\label{sec:app_related_work}

\textbf{Ratio-based biomarkers} are quantitative metrics that express the relative size, volume, or intensity of a target anatomical structure as a proportion of a reference region (Fig. \ref{fig:fig1}).
They are widely used across clinical domains to capture compositional, structural and functional changes, enabling standardized assessment of disease progression and treatment response. Examples include: ejection fraction -- representing the fraction of blood ejected from the ventricle during each cardiac cycle; coronary artery stenosis -- quantifying the percent narrowing of a coronary vessel, and fat fraction -- measuring the proportion of fat within an organ such as liver or kidney. Ratio-based biomarkers are particularly valuable for detailed tumor characterization. 
Key metrics include necrosis-to-tumor ratio (NTR) and core-to-tumor ratio (CTR), which quantify the internal structure of the tumor, as well as tumor invasion rate, which reflects the extent of tumor infiltration into surrounding tissues. In summary, the ratio-based measures offer standardized, comparable metrics that can be applied across imaging modalities, organs, and disease contexts.

Typically, clinicians compute these ratios using volumetric information from imaging data (\textit{e.g.}, MRI) \citep{henker2019volumetric,henker2017volumetric}. 
With the advancement of computational pathology and the growing availability of annotated medical data, recent studies \citep{ye2023automated} have developed AI-based workflows for automated ratio assessment. These methods offer scalable and consistent evaluations, effectively overcoming the limitations of subjective human judgment in manual assessments.
Despite promising developments, existing methods typically provide only point estimates \citep{ho2020deep}, neglecting the associated uncertainty. Although intuitive, results computed from the outputs of segmentation networks inherit the known overconfidence tendency of neural networks \citep{guo2017calibration}. As a result, naïve ratio estimations from miscalibrated outputs are often biased from true values. Current research predominantly focuses on improving network calibration and segmentation accuracy \citep{rousseau2025post,wang2023calibrating,mehrtash2020confidence, wang2022personalizing,hatamizadeh2021swin}, while overlooking the downstream task of biomarker estimation. 
Our work addresses this gap by proposing a confidence-aware framework for ratio estimation from segmentation models. 


\end{document}
