\section{Introduction}

Regression neural networks, which predict continuous quantities, have been a major focus of research in 
many areas including  medical diagnosis \cite{leibig2017leveraging}, weather forecasting \cite{scher2018predicting} and autonomous driving \cite{carvalho2015automated}. 
Regression neural networks are applied in
 medical image analysis to measure the size of pathological lesions and the size of anatomical parts and the distance between them. One example is 
ultrasound-based automatic estimation of fetal biometry which is used to assess the growth and well-being of the fetus \cite{avisdris2022}.
Another example is estimating the bone age of pediatric patients based on radiographs of their hand 
 \cite{halabi2019rsna}.

The performance of neural network systems has improved dramatically in recent years. However, for safety-critical embodied applications,  accurate prediction alone is not sufficient. Uncertainty estimates are important in a wide range of applications, 
and reporting the confidence of a prediction is essential for reliable and interpretable models. 
One widely adopted approach to conveying uncertainty is confidence intervals, which enclose the ``true value" with a specified probability. The size of these intervals is expected to be small and linked to the case's complexity.

Standard regression networks are trained by minimizing the Mean Squared Error (MSE). These networks
provides prediction intervals that have the same length
for all test examples, and thus potentially
cannot directly report an instance-based confidence interval. However, it is much more informative to provide larger confidence intervals for difficult examples and
smaller ones for easier examples to predict.
An alternative to the MSE approach is to predict the mean and variance simultaneously. The training loss is then formed by the negative Gaussian log-likelihood. During the testing phase, assuming a Gaussian distribution, the predicted mean and variance values can be translated into an instance-based confidence interval. \textcolor{black}{Other methods that quantify uncertainty in terms of a confidence interval include Bayesian learning \cite{sheridan2012three},   quantile regression \cite{koenker1978regression} and ensemble-based methods \cite{gal2016dropout,lakshminarayanan2017simple} that struggle with computational cost by requiring multiple model inferences.  Recent studies have compared these methods and no single method has emerged as consistently superior across all evaluation metrics and tasks (see e.g. \cite{lanini2024unique,kato2023review}).}

Consider a regression model that reports confidence intervals with a claimed coverage of $1-\alpha$. If $1-\alpha$ of the intervals indeed contain the true value, the model is called calibrated.
Deep networks are not well-calibrated  and are known to produce unreliable confidence information  \cite{Guo2017}. Several recent methods are available for measuring the calibration of a regression network such as the Expected Normalized Calibration Error (ENCE) \cite{levi2022} and the Uncertainty Calibration Error (UCE) \cite{laves2020well,kuppers2022}.  Several studies \cite{laves2020well,levi2022,Lior2023} have proposed a simple calibration method that scales the predictive variance by optimizing a likelihood criterion on a validation set. However, these calibration methods explicitly assume that the conditional density of the correct value given the image is Gaussian, which is not always the case.

 
 Conformal Prediction (CP) \cite{vovk2005conformal,angelopoulos2021gentle} is a general non-parametric calibration method
 which,  given a confidence value,  aims to build a confidence interval such that the probability that the correct value is within this set, is indeed the given value.
The Conformalized Quantile Regression (CQR) algorithm \cite{cqr2019} is a calibrated regression method that directly finds a confidence interval without any parametric assumption of the prediction distribution.
It consists of a Quantile Regression (QR) \cite{koenker1978regression} followed by a conformalization step.
On one hand, CQR is more adaptive to heteroscedasticity and  outliers than a Gaussian regression. 
On the other hand, the pinball loss which is used to train the Quantile regression is less reliable and stable than the Gaussian loss. 


CQR is considered the  state-of-the-art CP-calibration method to obtain a calibrated instance-based confidence interval of a prediction obtained by a neural network. However, we are not aware of any comparative research on CQR performance either in medical or non-medical data, (see e.g., a discussion in a recent review paper \cite{kato2023review}).
In the current study, we first analyze and demonstrate the advantages and disadvantages of the current methods. We then propose a calibration strategy based on training a Gaussian network followed by a parameter-free CP calibration of the computed variance. We show that this strategy improves the confidence calibration procedure. We report extensive experiments on several medical imaging regression tasks and network architectures that support this combined training and calibration procedure.



\section{Calibrating Regression Networks}
In this section we review parametric and non-parametric methods for
instance-dependent calibration of a regression network, which are related to the proposed method.
%\subsection{Gaussian Variance Scaling (Gaussian-VS) } 
Consider a regression network that outputs mean $\hat{y}=\mu(x)$ and variance $\sigma^2(x)$ for each input image $x$.
The mean represents the value predicted by the network, while the variance is the level of uncertainty in the prediction.
The network output can be viewed as a Gaussian distribution in the form of  $y|x \sim \mathcal{N} (\mu(x),\sigma^2(x))$.
Given  labeled training data $(x_1,y_1),...,(x_n,y_n)$,
the network is trained by minimizing the loss function:
\begin{equation}
L(\theta) = -\sum_{t=1}^n \log \mathcal{N}(y_t; \mu_{\theta}(x_t),\sigma_{\theta}^2(x_t))
\label{nll}
\end{equation}
such that $\theta$ is the network parameter set.
From the distribution $y \sim \mathcal{N}( \mu(x),\sigma^2(x))$ we can extract a confidence interval.  Define, 
$ \varphi(a)=\int_{-a}^{a} f(z) dz 
$
such that $z\sim \mathcal{N}(0,1)$ and $f(z)$ is its density function. 
For example,  a  90\% confidence interval for the prediction $\mu(x)$ is 
defined by:
\begin{equation}
%p( y \in [ \mu(x)-\varphi^{-1}(0.9)\cdot \sigma(x),  \mu(x)+\varphi^{-1}(0.9)\cdot \sigma(x) ] |x) = 0.9. \\ \\
\{y \, |\, c{|y - \mu(x)|}/{\sigma(x)} \le  \varphi^{-1}(0.9)   \}.  
\label{vartocon}
\end{equation}
Since the variance $\sigma^2(x)$ is predicted by the  neural network, it may not be well-calibrated and may be either underestimated or overestimated.
The training loss function (\ref{nll}) assumes Gaussian distribution, which may not be correct so that  the conversion from variance to confidence interval (\ref{vartocon}) can be wrong. 
Gaussian Variance Scaling (Gaussian-VS) \cite{laves2020well,levi2022,Lior2023}
is  a method to calibrate the variance $\sigma(x)$ that yields a meaningful confidence interval. This method is based on a Gaussian assumption of the conditional density $f(y|x)$. 
Gaussian-VS 
  computes a scalar $r$, which scales the variance predicted by the network: $\sigma(x) \rightarrow r\cdot\sigma(x)$.
Given  labeled validation data $(x_1,y_1),...,(x_n,y_n)$, we look for a scalar $r$  that minimizes the loss:
\begin{equation}
L(r) = -\sum_{t=1}^n \log \mathcal{N}( y_t;\mu_{\theta}(x_t),r^2\sigma^2_{\theta}(x_t)).
\label{lc}
\end{equation}
It is easy to verify that the optimal $r$ is:
 \begin{equation}
\hat{r}^2 = \frac{1}{n} \sum_{t=1}^n s_t^2 
\hspace{0.8cm} \mbox{s.t.}, \hspace{0.8cm}  
   s_t =  \frac{|y_t - \mu(x_t)|}{\sigma(x_t)}. 
   \label{mlc}
\end{equation}
Given a test image $x$, the calibrated  confidence interval with coverage $1\!-\!\alpha$ is: 
\begin{equation}
    [ \mu(x)-\varphi^{-1}(1\!-\!\alpha/2) \cdot \hat{r} \cdot \sigma(x),  \mu(x)+\varphi^{-1} (1\!-\!\alpha/2) \cdot \hat{r} \cdot \sigma(x) ].\end{equation}


Neural networks tend to be very robust to model mismatch and inaccurate ground truth measurements due to their high non-linearity and over-parameterization. In contrast, Gaussian-VS calibration is linear and consists of a single parameter. Thus if $f(y|x)$ is uni-modal but not Gaussian, while network training works well, Gaussian-VS fails to produce an accurate confidence interval.


%\subsection{Conformal Quantile Regression (CQR)}
 
 The CQR  calibration approach \cite{cqr2019} consists of a non-parametric  Quantile Regression (QR) \cite{koenker1978regression} followed by a conformalization step. Define the $\gamma$-quantile (pinball) loss:
$$
L_{\gamma} (y,t) =
1_{\{t<y\}} (y -  t) \gamma +
1_{\{t>y\}}( t - y ) (1-\gamma). 
            $$
Given a training set  $(x_1,y_1),\dots,(x_n,y_n)$ the QR algorithm trains a  $\gamma$-quantile estimation $\hat{t}_{\gamma}(x)$  using the pinball loss:
$$
L_{\gamma}(\theta) = \frac{1}{n} \sum_{i=1}^n L_{\gamma} (y_i,\hat{t}_{\gamma}(x_i,\theta)),
$$
such that $\theta$  is the network parameter set.
Given a miscoverage rate $\alpha$, we train a QR network with two heads   $\hat{t}_{\alpha/2}(x)$  and   $\hat{t}_{1-\alpha/2}(x)$   to obtain an instance-dependent  confidence interval:
$ [ \hat{t}_{\alpha/2}(x),  \hat{t}_{1-\alpha/2}(x)]$.
The CQR algorithm applies a CP procedure to ensure that the coverage is indeed $1-\alpha$.
Define the following conformal score:
$$
s(x,y) = \max \{ \hat{t}_{\alpha/2}(x)-y,  y-\hat{t}_{1-\alpha/2}(x)\}.
$$
%The score $s(x,y)$ is the minimum number $q$ such that $y\in  [ \hat{t}_{\alpha/2}(x)-q,  \hat{t}_{1-\alpha/2}(x)+q)]$.
Let $s_1,\dots,s_n$  be the conformal scores of a given  validation set $(x_1,y_1),\dots,(x_n,y_n)$ and
let ${q}$ be the $(1-\alpha)$ quantile of $s_1,\dots,s_n$.
The calibrated confidence interval is: $$ C_{{q}}(x)=[ \hat{t}_{\alpha/2}(x)-{q},  \hat{t}_{1-\alpha/2}(x)+{q}].$$

Unlike Gaussian-VS, the interval obtained by CQR has a coverage guarantee. The CP theory \cite{vovk2005conformal} guarantees that: $1\!-\!\alpha \le p( y\in C_{{q}}(x)) \le 1-\alpha +\frac{1}{n-1} $
where $y$ is the (unknown) true label. 
\textcolor{black}{Note that this is a marginal
probability over all possible test points and coverage may be worse or better for some
cases. It can be proved that conditional coverage is, in general, impossible \cite{foygel2021limits}.}
%Note that this is a marginal probability over all possible test points and is not conditioned on a given input.
QR is much more difficult to train than a Gaussian regression network and when QR produces poor interval estimations,
the performance of CQR is also affected since it tries to cover the guaranteed validity by sacrificing efficiency \cite{chung2021,kato2023review}.






\section{CP-based Variance Scaling} 

In this section we present a method that combines the benefits of the two methods described above.  
We first train a parametric Gaussian network to predict the target and its variance and then apply a
non-parametric CP to calibrate the estimated
variance. Assume we trained a regression network that outputs mean $\hat{y}=\mu(x)$ and variance $\sigma^2(x)$ for each input image $x$ as described in Section 2. 
Given a threshold $1-\alpha$, we can apply CP  
to scale the variance and find a confidence interval around $\mu(x)$ in the form of:
\begin{equation}
C_q(x)= [\mu(x)-q \cdot \sigma(x),\mu(x)+q \cdot \sigma(x) ]
\end{equation}
such that the true value $y$ is within this interval with probability $1-\alpha$, i.e.,  $p(y\in C_q(x))=1-\alpha.$
The calibration parameter $q$ is found in the following way.
For each labeled data $(x,y)$ define the conformal score:
 $s =  {|y-\mu(x)|}/{\sigma(x)}.$
 It is easy to verify that:
$$ C_s(x) =   [\mu(x) - |y-\mu(x)|,\mu(x) +|y-\mu(x)|]$$
and $y\in C_q(x)$ if and only if $q \ge s$. In other words, $C_s(x)$ is the minimal interval centered at  $\mu(x)$ which contains the true value $y$. Let $s_1,....,s_n$ be the conformal scores of the validation set  $(x_1,y_1),...,(x_n,y_n)$ respectively. The calibration value $q$ computed by the CP algorithm is the
 $\frac{\lceil(n+1)(1-\alpha)\rceil}{n}$ quantile of $s_1,...,s_n$.
 In other words, $q$ is the minimal value for which the true value lies within the confidence interval defined by $q$ in the  $(1 - \alpha)$ portion of the validation set.
  The CP theory \cite{vovk2005conformal}  guarantees that regardless of the data distribution, for  test data $(x,y)$, the value $q$ found by the CP algorithm satisfies:
\begin{equation}
1-\alpha \le p(y\in C_q(x)) \le 1-\alpha +\frac{1}{n-1}
\label{cptheory}
\end{equation}
such that $n$ is the size of the validation set.

 

\begin{algorithm}[t]
\begin{algorithmic}
 \caption{ Conformal-Prediction  based Variance Scaling (CP-VS) } \label{alg1}
\State   \textbf{input:} A labeled dataset divided into training and validation subsets and a confidence level $1\!-\!\alpha$.
\State - Train a regression network $x \rightarrow (\mu_{\theta}(x_t),\sigma_{\theta}^2(x))$ by minimizing the loss 
$$
L(\theta) = -\sum_{t=1}^n \log \mathcal{N}(y_t; \mu_{\theta}(x_t),\sigma_{\theta}^2(x_t))
$$
\State  - Compute the conformal scores on the validation set:
 $$s_t =   {|y_t-\mu(x_t)|}/{\sigma(x_t)}, \hspace{8mm} t=1,...,n$$
 \State - Sort the scores $s_1\!\le\! s_2,...,\!\le\!  s_n$   and  set  $q = s_{\lceil (n+1)(1\!-\!\alpha)\rceil}$.
\State - The confidence interval of a new test point $x$ is:
  $C(x)=[\mu(x)-q \cdot \sigma(x),\mu(x)+q \cdot \sigma(x) ].$
%\vspace{0.2cm}
\State - There is a marginal coverage guarantee: $p( y\in C(x)) \ge 1-\alpha$.
   \end{algorithmic}
\end{algorithm}

Note that both Gaussian-VS and   CP-VS  calibrate by scaling the estimated standard deviation $\sigma(x)$ using  a scalar value that is learned from the same conformal scores $s_1,...,s_n$ (\ref{mlc}) obtained from the validation set. Gaussian-VS is based on the scores' average while CP-VS is based on a quantile of the scores.
In case the conditional density of the target value given the input image is indeed Gaussian, the two calibration methods asymptotically coincide. In case the conditional density is not Gaussian, the quantile is a more effective   
measure than the mean when constructing a confidence interval. The CP-VS  algorithm is summarized in Algorithm Box \ref{alg1}.

The CP-VS has several benefits. Firstly, network training is conducted using the robust Gaussian loss (unlike CQR which uses the pinball loss). Secondly, CP-VS achieves calibration through the CP procedure which has a parameter-free theoretical coverage guarantee (unlike Gaussian-VS which has neither a theoretical nor practical coverage guarantee).  

 

\begin{table*}[t]
	\caption {Calibration results  measured by average confidence interval length and coverage (\%). The method that reports the minimal length among those who have a coverage guarantee is shown in bold.  }
	\label{table:ece_tab1}
   	 \centering
    \resizebox{\textwidth}{!}{
		\begin{tabular}{ll|cc|cc|cc}
			\toprule 
       \multicolumn{2}{c|}{$1-\alpha=0.9$ } &  \multicolumn{2}{c|}{Gaussian-VS } & \multicolumn{2}{c|}{CQR} 
       & \multicolumn{2}{c}{CP-VS}  
       \\
			{\small Dataset} & {\small Architecture} & 
                  length $\downarrow$ & coverage &  length $\downarrow$ &  coverage & length $\downarrow$ &  coverage\\
				\midrule
						\multirow{2}{*}{BoneAge} 
			&
			DenseNet-201 &  {0.184 $\pm$ 0.003} & 91.10 $\pm$ 0.71 &
                                   0.411 $\pm$ 0.005 & 90.02 $\pm$ 0.78  &
                                  {\textbf{0.176} $\pm$ 0.003} & {89.92} $\pm$ {0.85} 
                              \\
			&
		      EfficientNet-B4  & {{0.190} $\pm$ 0.003} & {90.19} $\pm$ {0.56} & 
                                    0.602 $\pm$ 0.007 & 90.02 $\pm$ 0.83 &   
                                        {\textbf{0.189} $\pm$ 0.003} & {89.95} $\pm$ {0.72}  \\
	
			\midrule
			\multirow{2}{*}{OCT} 
			&
			DenseNet-201 & 0.128 $\pm$ 0.001 & 99.94 $\pm$ 0.06 &
                                {0.106 $\pm$ 0.002} & 90.11 $\pm$ 0.99 &   
                               {\textbf{0.048} $\pm$ 0.001} & {90.47} $\pm$ {1.32}
                               \\
			&
			EfficientNet-B4 &  {0.127 $\pm$ 0.001} & 99.93 $\pm$ 0.06 &
                                   0.157 $\pm$ 0.003 & 89.97 $\pm$ 1.50 &
                                     {\textbf{0.050} $\pm$ 0.001} & {90.07} $\pm$ {1.80}  
                                 \\ 
                \midrule
			\multirow{2}{*}{\textcolor{black}{Brain}} 
			&
			DenseNet-201 & 0.316 $\pm$ 0.002 & 83.30 $\pm$ 0.54 &
                                {0.581 $\pm$ 0.002} & 89.80 $\pm$ 0.43 &   
                               {\textbf{0.371} $\pm$ 0.004} & {90.01} $\pm$ {0.56}
                               \\
			&
			EfficientNet-B4 &  {0.411 $\pm$ 0.002} & 90.84 $\pm$ 0.34 &
                                   0.631 $\pm$ 0.007 & 90.01 $\pm$ 0.47 &
                                     {\textbf{0.396} $\pm$ 0.004} & {90.01} $\pm$ {0.54}  
                                 \\ 
                \midrule
                
                \multirow{2}{*}{DLS1}
                & DenseNet201 & 0.102 $\pm$ 0.004 & 93.55 $\pm$ 1.32 & 0.278 $\pm$ 0.003 & 89.68 $\pm$ 1.01 & \textbf{0.092} $\pm$ 0.001 & {90.33} $\pm$ {1.04} \\ 
                & EfficientNet-B4 & 0.086 $\pm$ 0.008 & 96.11 $\pm$ 2.21 & 0.350 $\pm$ 0.004 & 90.37 $\pm$ 1.16 & \textbf{0.068} $\pm$ 0.001 & {89.87} $\pm$ {0.98} \\ 


                \midrule
                \multirow{2}{*}{DLS2}
                & DenseNet201 & 0.063 $\pm$ 0.001 & 91.25 $\pm$ 0.72 & 0.316 $\pm$ 0.002 & 90.36 $\pm$ 0.89 & \textbf{0.060} $\pm$ 0.001 & {89.79} $\pm$ {0.87} \\ 
                & EfficientNet-B4 & 0.071 $\pm$ 0.001 & 92.56 $\pm$ 0.68 & 0.299 $\pm$ 0.002 & 89.93 $\pm$ 0.78 & \textbf{0.064} $\pm$ 0.001 & {89.94} $\pm$ {1.13} \\ 


                \midrule
                \multirow{2}{*}{DLS3}
                & DenseNet201 & 0.157 $\pm$ 0.004 & 91.53 $\pm$ 1.02 & {0.178} $\pm$ 0.002 & 90.18 $\pm$ 0.90 & \textbf{0.151} $\pm$ 0.003 & {90.18} $\pm$ {0.88} \\ 
                & EfficientNet-B4 & 0.076 $\pm$ 0.005 & 94.41 $\pm$ 1.85 & {0.299} $\pm$ 0.002 & 89.96 $\pm$ 0.78 & \textbf{0.065} $\pm$ 0.001 & {90.12} $\pm$ {1.06} \\ 


                \midrule
                \multirow{2}{*}{DLS4}
                & DenseNet201 & 0.093 $\pm$ 0.006 & 93.30 $\pm$ 1.70 & 0.220 $\pm$ 0.002 & 89.92 $\pm$ 0.99 & \textbf{0.082} $\pm$ 0.001 & {89.86} $\pm$ {1.08} \\ 
                & EfficientNet-B4 & 0.077 $\pm$ 0.004 & 94.07 $\pm$ 1.74 & 0.328 $\pm$ 0.005 & 90.21 $\pm$ 0.93 & \textbf{0.069} $\pm$ 0.001 & {90.22} $\pm$ {0.84} \\ 

                \midrule
                \multirow{2}{*}{DLS5}
                & DenseNet201 & 0.115 $\pm$ 0.001 & {90.45} $\pm$ {0.73} & 0.227 $\pm$ 0.003 & 90.19 $\pm$ 0.89 & \textbf{0.114} $\pm$ 0.002 & {89.99} $\pm$ {1.09} \\ 
                & EfficientNet-B4 & 0.066 $\pm$ 0.002 & 92.96 $\pm$ 0.84 & 0.474 $\pm$ 0.011 & 89.75 $\pm$ 1.14 & \textbf{0.058} $\pm$ 0.002 & {89.70} $\pm$ {1.27} \\ 



            \bottomrule
		\end{tabular} 
       } \\
 
        \vspace{0.3cm}
 
     
   	 \centering
    \resizebox{\textwidth}{!}{
		\begin{tabular}{ll|cc|cc|cc}
			\toprule 
      \multicolumn{2}{c|}{$1-\alpha=0.95$ } &  \multicolumn{2}{c|}{Gaussian-VS } & \multicolumn{2}{c|}{CQR} 
       & \multicolumn{2}{c}{CP-VS} \\ 
			{\small Dataset} & {\small Architecture} & 
                  length $\downarrow$ & coverage &  length $\downarrow$ &  coverage  &  length $\downarrow$ &  coverage\\
				\midrule
					\multirow{2}{*}{BoneAge} 
			&
			DenseNet-201 &  {0.219 $\pm$ 0.004} & {95.40} $\pm$ {0.43}
                                & 0.539 $\pm$ 0.011 & {94.97} $\pm$ {0.87} 
                                & \textbf{0.224} $\pm$ 0.004 & {94.91} $\pm$ {0.53} \\
			&
		      EfficientNet-B4  &  {0.227 $\pm$ 0.003} & {94.71} $\pm$ {0.43} &
                                   0.489 $\pm$ 0.008 & 95.08 $\pm$ 0.58 
                                  &  \textbf{0.233} $\pm$ 0.004 & {95.26} $\pm$ {0.47} \\
				\midrule
			\multirow{2}{*}{OCT} 
			&
			DenseNet-201 &  0.153 $\pm$ 0.001 & 99.94 $\pm$ 0.06  
                                & {0.146 $\pm$ 0.005} & 95.07 $\pm$ 1.20
                                & \textbf{0.057} $\pm$ 0.001 & {94.96} $\pm$ {0.87} \\
			&
			EfficientNet-B4 & {0.152 $\pm$ 0.002} & 99.98 $\pm$ 0.02 &
                                0.235 $\pm$ 0.006 & 95.12 $\pm$ 0.69 & 
                                   \textbf{0.059} $\pm$ 0.001 & {94.99} $\pm$ {1.06} \\
                                 \midrule
			\multirow{2}{*}{\textcolor{black}{Brain}} 
			&
			DenseNet-201 & 0.377 $\pm$ 0.001 & 89.78 $\pm$ 0.30 &
                                {0.786 $\pm$ 0.000} & 95.68 $\pm$ 0.16 &   
                               {\textbf{0.518} $\pm$ 0.009} & {95.04} $\pm${0.44}
                               \\
			&
			EfficientNet-B4 &  {0.380 $\pm$ 0.002} & 91.15 $\pm$ 0.34 &
                                   0.754 $\pm$ 0.008 & 94.97 $\pm$ 0.41 &
                                     {\textbf{0.498} $\pm$ 0.010} & {94.96} $\pm$ {0.38}  
                                 \\ 
                \midrule
                \multirow{2}{*}{DLS1}
                & DenseNet201 & 0.121 $\pm$ 0.005 & 96.97 $\pm$ 0.73 & 0.362 $\pm$ 0.004 & 95.09 $\pm$ 0.67 & \textbf{0.110} $\pm$ 0.002 & {95.35} $\pm$ {0.67} \\ 
                & EfficientNet-B4 & 0.102 $\pm$ 0.010 & 98.38 $\pm$ 1.10 & 0.425 $\pm$ 0.004 & 95.20 $\pm$ 0.65 & \textbf{0.080} $\pm$ 0.001 & {94.87} $\pm$ {0.63} \\ 

                \midrule

                \multirow{2}{*}{DLS2}
                & DenseNet201 & 0.075 $\pm$ 0.001 & 96.02 $\pm$ 0.44 & 0.352 $\pm$ 0.003 & 95.14 $\pm$ 0.71 & \textbf{0.071} $\pm$ 0.001 & {94.81} $\pm$ {0.56} \\ 
                & EfficientNet-B4 & 0.084 $\pm$ 0.001 & 95.52 $\pm$ 0.43 & 0.379 $\pm$ 0.003 & 94.88 $\pm$ 0.58 & \textbf{0.081} $\pm$ 0.002 & {94.92} $\pm$ {0.61} \\ 

                \midrule
                \multirow{2}{*}{DLS3}
                & DenseNet201 & 0.183 $\pm$ 0.005 & 95.74 $\pm$ 0.63 & 0.172 $\pm$ 0.003 & 95.17 $\pm$ 0.78 & \textbf{0.179} $\pm$ 0.003 & {95.24} $\pm$ {0.59} \\ 
                & EfficientNet-B4 & 0.090 $\pm$ 0.007 & 97.74 $\pm$ 1.13 & 0.432 $\pm$ 0.006 & 95.06 $\pm$ 0.47 & \textbf{0.078} $\pm$ 0.001 & {95.15} $\pm$ {0.63} \\ 


                \midrule
                \multirow{2}{*}{DLS4}
                & DenseNet201 & 0.111 $\pm$ 0.007 & 96.70 $\pm$ 1.02 & 0.251 $\pm$ 0.003 & 95.05 $\pm$ 0.72 & \textbf{0.100} $\pm$ 0.002 & {94.96} $\pm$ {0.82} \\ 
                & EfficientNet-B4 & 0.092 $\pm$ 0.005 & 97.73 $\pm$ 0.84 & 0.363 $\pm$ 0.008 & 94.97 $\pm$ 0.63 & \textbf{0.080} $\pm$ 0.001 & {95.02} $\pm$ {0.73} \\ 
 
                \midrule
                \multirow{2}{*}{DLS5}
                & DenseNet201 & 0.138 $\pm$ 0.001 & {95.33} $\pm$ {0.46} & 0.283 $\pm$ 0.003 & 95.21 $\pm$ 0.59 &\textbf{0.134} $\pm$ 0.002 & {94.76} $\pm$ {0.65} \\ 
                & EfficientNet-B4 & 0.079 $\pm$ 0.002 & 96.16 $\pm$ 0.54 & 0.552 $\pm$ 0.012 & 94.70 $\pm$ 0.64 & \textbf{0.073} $\pm$ 0.002 & {94.77} $\pm$ {0.68} \\ 

            \bottomrule
		\end{tabular}%
      }
    	\end{table*}
\section{Experimental Results}

In this section, we empirically compare the performance of the confidence intervals computed by CP-VS   to those computed by Gaussian-VS and CQR in terms of both interval length and coverage.


\begin{figure}[h]
\center
    \subfigure[]
   {\includegraphics[trim = 0 200 0 200,clip,width=2.45cm]{
    6956.png}}
    \hspace{1cm}
    \subfigure[]
   {\includegraphics[width=2.45cm]{
    7246.png}}
    \hspace{1cm}
    \subfigure[]
   {\includegraphics[width=2.45cm]{
    11613.png}}
    \caption{Samples from the BoneAge test set. (a) Target is 0.47, intervals are - Gaussian-VS: [0.37,0.57], CP-VS: [0.38,0.56], CQR: [0.20,0.83]. (b) Target is 0.76, intervals are - Gaussian-VS: [0.67,0.79], CP-VS: [0.68,0.78], CQR: [0.23,0.81]. (c) Target is 0.21, intervals are - Gaussian-VS: [0.17,0.31], CP-VS: [0.17,0.31], CQR: [0.24,0.90].}
    \label{fig:boneage_test_samples}
\end{figure}


\begin{figure}[h]
\center
    \subfigure[]
   {\includegraphics[width=3.2cm]{left.jpg}}
    \hspace{0.8cm}
    \subfigure[]
   {\includegraphics[width=3.2cm]{middle.jpg}}
    \hspace{0.8cm}
    \subfigure[]
   {\includegraphics[width=3.2cm]{right.jpg}}
    \caption{\textcolor{black}{Samples from the DLS-1 test set. Blue - true position of the lumbar, green - predicted position of the lumbar, orange - bounding box created by CP-VS, pink - bounding box created by CQR.}}
    \label{fig:dls-1-test}
\end{figure}



\textbf{Datasets.} We implemented the proposed calibration methods on several medical imaging regression tasks to evaluate their performance. The experimental setup follows the one used in \cite{laves2020well} and  includes the following medical datasets:
\begin{itemize}
\item 
 BoneAge - Hand CT age regression from the RSNA pediatric bone age dataset  \cite{halabi2019rsna}.  The task here is to infer a person’s age in months from CT scans of the hand. This dataset is the largest used in this study  and has 12,811 images, from which we used 6811/2000/4000 images for training/validation/testing.
\item   OCT - Six degrees of freedom (6DoF) needle pose estimation on optical coherence tomography (OCT).  This dataset contains 5,000 3D-OCT scans with the accompanying needle pose $y \in [0,1]^6$, from which we use 3300/850/850 for training/validation/testing \cite{laves2020well}.
\item \textcolor{black}{Brain - We used the brain tumor dataset from the Medical Segmentation Decathlon \cite{Simpson2019, Antonelli2022}, which consists of 484 brain MRI scans with corresponding tumor segmentation masks. The dataset was split into training, validation, and test sets in an 80\%/20\%/20\% ratio. Each scan is a 3D volume of size $240 \times 240 \times 155$. We extracted individual slices from each MRI scan, resulting in 155 image slices of size $240 \times 240$ per scan. A regression target was assigned to each image by counting the number of labeled brain tumor pixels \cite{Gustafsson2023}.}
\item DLS - This dataset is designed to facilitate the detection and classification of degenerative lumbar spine (DLS) conditions using MRI images. Each image includes annotations for the $(x, y)$ positions of five vertebrae. %The network was trained to predict the value $\frac{x+y}{2}$.
For each vertebra, the dataset is divided into 60\% for training, 20\% for validation, and 20\% for testing, with approximately 10,000 images per vertebra \cite{rsna-2024-lumbar-spine-degenerative-classification}.


\end{itemize}
   
\textbf{Implementation details.}  The network architectures used were
EfficientNet-B4 \cite{tan2019efficientnet} and DenseNet-201  \cite{he2016deep,Guo2017}. The last
linear layer of all networks was replaced by two linear layers predicting the mean and log-variance.
The networks were trained until no further decrease of the loss on the validation set could be observed. During training, each input $x$ was passed through the network 25 times, with a dropout applied, resulting in variations among the outputs. The final prediction used in the loss function was computed as the average of these 25 outputs. To implement the CQR algorithm, we used the code from the CQR project 
GitHub\footnote{\url{https://github.com/yromano/cqr}}. CQR is trained to minimize the average length while keeping the coverage valid. \textcolor{black}{Note that a different QR network must be trained for each value of the threshold $1 - \alpha$. In contrast, in the case of CP-VS, the Gaussian network is trained only once, and only the CP step needs to be redone for each threshold.}



\textbf{Evaluation measures.} The standard direct way to evaluate the performance of confidence-interval estimators on a given test set $(x_1,y_1),...,(x_n,y_n)$ is by computing the degree of coverage and the average length of the prediction intervals.   Smaller average interval widths indicate higher precision.  The length and coverage are formally defined as:
$$ \textrm{length} = \frac{1}{n} \sum_i | C(x_i) |,
\hspace{0.2cm}  \textrm{coverage} = \frac{1}{n} \sum_i 
{\textbf 1}(y_i \in C(x_i))$$
such that $C(x_i)$ is the confidence interval of  $x_i$  and $|C(x_i)|$ is its length.
The best algorithm is the one that reports the minimal average interval length among those that satisfy the coverage requirement.


\begin{figure}[]
\centering
   \subfigure[BoneAge]{\includegraphics[width=4.2cm]{boneage.pdf}}
   \hspace{0.5cm} 
   \subfigure[OCT]{\includegraphics[width=4.2cm]{oct.pdf}}
   \hspace{0.5cm} 
   \subfigure[DLS1]{\includegraphics[width=4.2cm]{lumbar.png}}
   
   \caption{Histograms of the normalized network prediction values computed on the validation sets using DenseNet-201.}
\end{figure}





\textbf{Results.} Table 1 shows the comparative calibration results (length and coverage) for the three methods (Gaussian-VS, CQR and CP-VS) on the test set. The results were averaged over 20 random splits of the data into validation and test sets. 
Both CP-VS and CQR, which apply a CP procedure, obtained the exact required coverage, as guaranteed by the CP theorem.
However, the average length reported by the CQR was much larger due to the non-robust training of the QR algorithm.  The coverage rate of Gaussian-VS was inconsistent. In some cases, it was below the required coverage $(1-\alpha)$; in other cases, it was above (resulting in a large average interval). Note that Gaussian-VS has no theoretical coverage guarantee.   

\textbf{Visual examples.} We next illustrate the proposed method on several examples. Figure \ref{fig:boneage_test_samples}  shows examples from the BoneAge  dataset. The figures illustrate the same trend that was reported  in Table 1, namely that the CP-VS yields confidence intervals with the smallest size.  \textcolor{black}{ Next  we illustrate results of predicting the positions of lumbar L1/L2 using the DLS-1 dataset. Two networks were trained, one for each dimension, and the CP procedure was applied to guarantee 95\% coverage for each dimension. As a result, 90\%  of the bounding boxes produced by CP-VS and CQR will accurately encompass the true position of the lumbar. % (Bonferroni correction  \cite{sedgwick2012multiple}).
Fig. \ref{fig:dls-1-test} presents examples of bounding boxes around lumbar position predictions computed by CP-VS and CQR on images from the test set. Notably, CP-VS produces considerably smaller bounding boxes.}



\textbf{Normality check. } We next analyzed whether the output distribution of the Gaussian network was indeed Gaussian for each dataset. For each image $x$ in the validation set, we computed the scalar $(y-\mu(x))/\sigma(x)$ such that $y$ was the correct value and $\mu(x)$ and $\sigma(x)$ were predicted by the network.
Note that $y \sim \mathcal{N}( \mu(x),\sigma^2(x))$ implies that  $(y-\mu(x))/\sigma(x) \sim \mathcal{N}(0,1)$.
  The histograms of the three datasets (BoneAge, OCT and DLS) are 
  shown in Fig.~1. 
We also applied the Kolmogorov–Smirnov test to check the data's normality and obtained 0.018 (BoneAge), 1.42e-60 (OCT), and 0.001 (DLS1). 
Hence, the BoneAge task histogram was the only one that resembled a Normal distribution and passed the Gaussianity test.
We can see in Table 1  that when the normal assumption is valid, Gaussian-VS works well and produces an effective confidence interval.  However,  when the normal assumption fails Gaussian-VS has inconsistent behaviors. In some cases it doesn't satisfy the coverage requirement and in other cases it yields large confidence intervals.  
  

 \section{Conclusions}
 
In this study, we addressed the problem of reporting a reliable confidence interval that varies in size across the images and reflects instance-specific uncertainty with a theoretical coverage guarantee that makes it useful in real systems.  We proposed a CP-based procedure that calibrates the prediction of a Gaussian network. We showed that CP-VS produces a confidence interval whose average is much smaller than CQR while maintaining the same coverage guarantee. %We showed that Gaussian-VS cannot be used in medical applications since its confidence interval has no statistical meaning. 
%We showed that CQR achieves poor results in terms of average interval length. We demonstrated that CP-VS, which is based on training a network using a Gaussian loss followed by a distribution-free calibration based on CP, achieved the best results. 
We focused here on medical imaging applications, but the conclusions are general and relevant for calibrating any regression network. \textcolor{black}{CP-based  Calibration algorithms (in both classification and regression setups) are not robust to real-world situations of missing data, label noise \cite{Einbinder2024} and distribution shift \cite{Gustafsson2023}.  Possible future research directions include extending the proposed method in a way that allows it to handle these problems.}



