\documentclass{midl} % Include author names [anon]

% The following packages will be automatically loaded:
% jmlr, amsmath, amssymb, natbib, graphicx, url, algorithm2e
% ifoddpage, relsize and probably more
% make sure they are installed with your latex distribution

\usepackage{mwe} % to get dummy images
\usepackage{booktabs}
\usepackage{comment}
\usepackage{adjustbox}
\usepackage{dirtytalk}
\usepackage{nicefrac}
\usepackage{float}
\usepackage{multirow}
\usepackage{listings} % For code snippets
\usepackage{xcolor}   % For coloring code
\usepackage{mathtools} % for '\DeclarePairedDelimiter' macro
\DeclarePairedDelimiter{\abs}{\lvert}{\rvert}

% Define specific colors for syntax highlighting
\definecolor{codegreen}{rgb}{0,0.6,0}
\definecolor{codegray}{rgb}{0.5,0.5,0.5}
\definecolor{codepurple}{rgb}{0.58,0,0.82}
\definecolor{backcolour}{rgb}{0.95,0.95,0.92}

% Setup the style for the code listing
\lstdefinestyle{mystyle}{
    backgroundcolor=\color{backcolour},   
    commentstyle=\color{codegreen},
    keywordstyle=\color{magenta},
    numberstyle=\tiny\color{codegray},
    stringstyle=\color{codepurple},
    basicstyle=\ttfamily\footnotesize,
    breakatwhitespace=false,         
    breaklines=true,                 
    captionpos=b,                    
    keepspaces=true,                 
    numbers=left,                    
    numbersep=5pt,                  
    showspaces=false,                
    showstringspaces=false,
    showtabs=false,                  
    tabsize=2
}

\lstset{style=mystyle}


% to be uncommented for submissions under review
\jmlryear{2026}
\jmlrworkshop{Full Paper -- MIDL 2026}
\jmlrvolume{-- 317}
\editors{Accepted for publication at MIDL 2026}
\title[Beyond scalar losses: calibrating segmentation models via gradient vector field surgery]{Beyond scalar losses: calibrating segmentation models via gradient vector field surgery}


% More complicate cases, e.g. with dual affiliations and joint authorship
\midlauthor{
\Name{Laurin Lux\nametag{$^{1,2,3}$}} \Email{laurin.lux@tum.de}\\
\Name{Alexander H. Berger\nametag{$^{1,3,4}$}} \Email{a.berger@tum.de}\\
\Name{Moritz Knolle\nametag{$^{3}$}} \Email{moritz.knolle@tum.de}\\
\Name{Daniel Rückert\nametag{$^{1,2,4,6}$}} \Email{daniel.rueckert@tum.de}\\
\Name{Johannes C. Paetzold\nametag{$^{4,5}$}} \Email{jpaetzold@med.cornell.edu}\\
\addr $^{1}$ School of Computation, Information and Technology, TUM, Munich, Germany \\
\addr $^{2}$ Munich Center for Machine Learning, Munich, Germany \\
\addr $^{3}$ Department of Radiology, Weill Cornell Medicine, New York City, USA \\
\addr $^{4}$ School of Medicine and Health, TUM University Hospital, Munich, Germany \\
\addr $^{5}$ Cornell Tech, New York City, USA \\
\addr $^{6}$ Department of Computing, Imperial College London, London, UK \\
}
\begin{document}

\maketitle

\begin{abstract}
Region-based loss functions, such as the Dice loss, have established themselves as the de facto standard for highly class- and region-imbalanced segmentation tasks. However, models trained using region-based loss functions are notoriously miscalibrated and typically yield over-confident predictions. In medical imaging applications, such as defining tumor resection margins, this miscalibration is hindering clinical adoption. In this work, we outline a novel gradient perspective on this overconfidence and show how it affects region-based loss functions. We propose a "surgery" on the gradient vector field as a simple, yet effective intervention to mitigate calibration issues. This surgery adds a factor to the loss's partial derivative, scaling the gradient's magnitude linearly with the prediction error. In empirical evaluations across 2D and 3D medical segmentation tasks, we demonstrate the effectiveness of this intervention while maintaining high prediction accuracy when used in conjunction with any region-based loss function. 
\end{abstract}

\begin{keywords}
Segmentation, Calibration, Optimization, Gradient Surgery, Metastases
\end{keywords}

\section{Introduction}
The Dice Similarity Coefficient (DSC) has become a primary evaluation metric and loss function in medical image segmentation.
Originally adapted for volumetric segmentation (\citet{milletari2016v}), the Dice loss and its derivatives (e.g. \cite{salehi2017tversky, taghanaki2019combo} excel in scenarios of extreme class imbalance—a ubiquitous challenge in medical imaging where foreground structures (e.g., lesions or vessel fragments) occupy negligible fractions of the image volume. By directly optimizing a continuous approximation of the region overlap between predictions and ground truth, the Dice loss circumvents the local minima often encountered when training voxel-wise objectives on highly imbalanced data.

However, this robustness to class imbalance comes at a cost.  
The Dice loss cannot inherently enforce probabilistic consistency with the underlying data-generation process, unlike e.g. the Cross-Entropy (CE) loss, which corresponds directly to a proper scoring rule \cite{gneiting2007strictly}.
Instead, models trained with Dice loss exhibit pathological overconfidence, pushing softmax probabilities toward $0$ or $1$ regardless of the actual epistemic uncertainty. This creates a significant dichotomy for clinical model development. 
In high-stakes workflows, such as defining tumor resection margins or radiotherapy target volumes, a segmentation map is not merely a binary mask but a decision boundary. Well-calibrated predictions enable meaningful verification and the imperative possibility of adapting outputs to high-recall or high-sensitivity solutions \cite{sander2019towards, jiang2012calibrating}.


In this work, we analyze partial derivatives w.r.t. the logits, influencing the gradient on the network's weights, to identify the root cause of miscalibration inherent to all region-based segmentation losses. We show that the gradient dynamics of these losses effectively neglect the calibration of the predicted probabilities and only optimize for region overlap between predictions and ground truth. To mitigate this issue, we propose a \textit{gradient surgery}, a simple yet effective intervention (surgery) on the gradient vector field of the model's voxel-wise logit outputs. Given a network's predicted probability $p$, this intervention rescales the loss's partial derivative w.r.t. single pixel logits such that the error $|y-p|$ has a linear influence. In extensive empirical experiments on 2D and 3D medical segmentation tasks, we show that our proposed method improves model calibration while maintaining high segmentation performance.

\section{Related work}
Seminal works by \citet{mehrtash2020confidence, bertels2019optimization, sander2019towards} demonstrate that segmentation models trained with Dice loss provide miscalibrated, overconfident predictions and thus questioned their clinical applicability. Initial mitigation strategies included model ensembles to improve the calibration of such region-based losses \cite{mehrtash2020confidence}. Other common strategies involve compound objectives, such as the Combo Loss \cite{taghanaki2019combo} or Unified Focal Loss \cite{yeung2022unified}, which compute a weighted sum of Dice and CE (or Focal) terms. While these stabilize training, they often require extensive tuning of the weighting hyperparameter $\lambda$ and represent a compromise rather than a theoretical fix for the miscalibration. The marginal L1 average calibration error (mL1-ACE) was recently proposed as an auxiliary loss that is specifically targeted at improving voxel-wise calibration \cite{barfoot2024average}. Another focus was the direct adaptation of region-based losses. The Tversky Loss \cite{salehi2017tversky} generalizes the Dice coefficient to allow for individual weighting of false positives and false negatives, which impacts precision and recall but does not explicitly address probabilistic calibration. More recently, DSC++ \cite{yeung2023calibrating} introduced an exponent $\gamma > 1$ to the Dice formulation to selectively penalize overconfident, incorrect predictions. While the focal $\gamma$ results in improved calibration, it can drastically change the gradient dynamics compared to the Dice loss through the down-weighting of samples with large fractions of false positives and false negatives (see Appendix \ref{sec:app_dice++}). 
An alternative to modifying the primary loss is post-hoc recalibration \cite{rousseau2021post}. Techniques such as temperature scaling \cite{guo2017calibration}, Platt scaling \cite{platt1999probabilistic}, and isotonic regression map \cite{zadrozny2002transforming} model outputs to calibrated probabilities after training. While effective on in-distribution data, these methods do not improve the quality of the learned feature representation and are known to degrade under the domain shifts common in medical deployment. 

Other works \cite{islam2021spatially,murugesan2025neighbor,murugesan2023trust,karani2023boundary} have focused on specific solutions for the uncertainty specific to lesion boundaries that are inherent to the data annotation process. Spatially varying labels smoothing (SVLS) \cite{islam2021spatially} draws inspiration from label smoothing, specifically smoothing the voxels with varying neighbor annotations (i.e., boundary voxels). This method improves calibration for brain tumor, kidney tumor, long nodule, and prostate zone segmentation. Neighbor-Aware Calibration (NACL) \cite{murugesan2025neighbor} reformulates and extends SVLC by treating it as a neighborhood-aware penalty. Moreover, it applies a constraint directly on the logits \cite{liu2022devil}, effectively reducing their magnitude. The penalty formulation allows flexible weighting of the initial optimization objective with the neighborhood-aware logit distance constraint.
Finally, boundary-weighted consistency regularization (BWCR) \cite{karani2023boundary} forces logit consistency across corresponding pixels from different augmented versions of the same input. However, all of these methods are not specifically designed to address calibration issues in models trained with region-based losses. In contrast, our work specifically targets the overconfidence problem in region-based losses, aiming to improve performance when region-based losses are preferred over standard cross-entropy loss.


\begin{figure}[ht!]
    \centering
    \includegraphics[width=0.95\linewidth]{figures/fig1_gradient_dynamics.pdf}
    \caption{Partial derivatives of dice loss (top row) vs. cross entropy loss (bottom row) for single voxels. Sub-panels show the absolute value of: $\frac{\delta L}{\delta p_i}$, $\frac{\delta p}{\delta z_i}$, and $\frac{\delta L}{\delta z_i}$ as a function of the predicted probability $p$ for a foreground ($y=1$, yellow) and a background ($y=0$, purple) voxel. Red squares indicate intersection points where the magnitude of foreground and background derivatives is in equilibrium. For cross-entropy, the curves intersect at $p=0.5$, encouraging uncertain predictions for indistinguishable voxel-representations with different labels. For the Dice loss, they intersect at $p \in \{0,1\}$, effectively pushing \textit{all} predicted probabilities to extreme values.}
    \label{fig:partials_logits}
\end{figure}

\section{Gradient dynamics of region-based segmentation losses: analysis and intervention}
Below, we present a concise analysis of the Dice loss's gradient dynamics alongside our proposed intervention that encourages calibrated predictions. We assume a binary segmentation problem using a final sigmoid activation function to turn logits into probability values.

\begin{figure}[t]
    \centering
    \includegraphics[width=0.8\linewidth]{figures/fig2_grad_logit.pdf}
    \caption{Visualization of the gradient w.r.t. the voxel-wise logits. Purple circles indicate confident errors, where gradients vanish through the activation function for Dice.}
    \label{fig:grads_logits}
\end{figure}

\subsection{Region-based losses converge to miscalibrated solutions}
\label{sec:method_miscalibration}
The soft dice loss for a prediction/target mask pair $P\in \mathbb{R}^{N}$ and $Y\in \{0,1\}^{N}$ is defined as:

\begin{equation}
    \operatorname{DSC} (P, Y) = 1-\frac{2 \sum_{i=1}^{N} p_i y_i + \epsilon}{\sum_{i=1}^{N} p_i + \sum_{i=1}^{N} y_i + \epsilon}= 1-\frac{2I + \epsilon}{P + Y + \epsilon},
\end{equation}

where $I$ is the intersection between $P$ and $Y$. The partial derivative of the Dice loss w.r.t. the predicted probability $p_i$ for an individual input voxel $i$ is:

\begin{equation}
    \frac{\partial L_{DSC}}{\partial p_{i}}
= \frac{2y_{i}(P + Y + \varepsilon) - (2I + \varepsilon)}{(P + Y + \varepsilon)^{2}},
\end{equation}

which we will refer to as the \say{global} term $G(i)$. Crucially, as with all region-based losses, $G(i)$ can usually be approximated as a constant $G$ because a single voxel has negligible influence on $G$ as the image size $N$ increases. Using sigmoid/softmax activation on the model's output, the partial derivative of the loss w.r.t. a single voxel logit $z_i$ is:

\begin{equation}
    \frac{\partial L_{DSC}}{\partial z_{i}} = \frac{\partial L_{DSC}}{\partial p_{i}} \cdot \frac{\partial p_{i}}{\partial z_{i}}
= G(i) \cdot p_{i}(1 - p_{i}).
\label{eq:pd_z}
\end{equation}




These partial derivatives have two undesirable properties, visualized in Figure \ref{fig:partials_logits}.
First, $\nicefrac{\partial L_{DSC}}{\partial z_{i}}$ is maximized by uncertain voxels ($p\approx0.5$) while confident predictions ($p\approx0$ or $p\approx1$) have negligible influence on the gradient w.r.t. the network's parameters. Importantly, this occurs independently of their correctness; i.e., confident but incorrect predictions do not contribute to the gradient w.r.t. the network weights, as shown in Figure \ref{fig:grads_logits}.
Second, the partial derivatives for foreground and background intersect only at the function's boundaries, i.e., $0$ and $1$; see red squares in Figure \ref{fig:partials_logits}, top row, which causes the network to converge to overconfident predictions. 
%Intuitive explanation for a single pair of voxels
%Intuitively, this effect can be explained when 
We make a simplified argument by considering a scenario where a network is trained to a point where it has exhausted its maximal discriminative capacities, i.e., there exist voxels $a$ and $b$ with different labels ($y_a=1$ and $y_b=0$) that are indistinguishable through the network's latent representation $\textbf{l}(x)$, i.e. $\textbf{l}(a) \approx \textbf{l}(b)$. Therefore, the network is forced to output highly coupled probabilities for both voxels, i.e., $p_a\approx p_b$. The described scenario naturally evolves when a network is trained towards convergence without reaching zero loss. In this scenario, $a$ and $b$ influence the gradient w.r.t. the network parameters in opposite directions through the opposing ground truth labels for these ”indistinguishable” voxels. For $y_a=1$, the network parameters are guided towards a higher probability, and for $y_b=0$ towards a lower probability. Therefore, the network's weights converge to output the probability for $\textbf{l}(a)$ and $\textbf{l}(b)$ such that the influence of $a$ and $b$ on the gradient w.r.t. the network weights is in an equilibrium (red squares in Figure \ref{fig:partials_logits}):

\begin{equation}
\label{eq:equilibrium}
    \frac{\partial L}{\partial z_a} = -\frac{\partial L}{\partial z_b}
\end{equation}

% extending to sets of voxels
Formally, this equilibrium can be extended from a single pair of indistinguishable voxels $v$ to sets of indistinguishable voxels $S_k:=\{v \mid l(v)\approx c_k\}$, where $c_k$ is the voxel set's shared latent representation. As described above, the network predicts the same $p_k$ for all elements (voxels) in a set $S_k$. In addition to $p_k$, a set $S_k$ is characterized by its ratio between foreground and background labels $r_k=\nicefrac{|S^{fg}_k|}{|S_k|}$, where $S^{fg}_k:=\{v \in S_k | y(v)=1\} \subset S_k$ is the subset of $S_k$ containing the voxels with ground truth label $y=1$. These sets are naturally encountered when training on complete samples/batches consisting of a large number of voxels, whose influences accumulate, resulting in numerous different equilibrium probabilities that characterize the network's calibration.


In the case of cross-entropy, the equilibrium for any set of indistinguishable voxels $S_k$ with foreground ratio $r_k$ is reached at $p_k=r_k=\nicefrac{|S_1|}{|S|}$. Note that this directly corresponds to perfect calibration, where the predicted probabilities correspond to the underlying data-generating distribution. \cite{guo2017calibration}. 

In the case of Dice, an equilibrium can only be reached for $p_a=p_b=0 \lor p_a=p_b=1$, which is independent of the set's label ratio and leads to overconfident predictions that are unrelated to the underlying data-generating distribution. 



\subsection{Combining calibration and region size imbalance awareness using gradient surgery}

We hypothesize that ideally, the partial derivatives w.r.t. the voxel logits respect (1) error magnitude to obtain equilibria resulting in calibrated probability outputs (similar to CE, see Figure \ref{fig:partials_logits}), and (2) dynamic adaptation to drastic region size imbalance for overlap maximization. Suitable partial derivatives for foreground and background that fulfill these requirements are:

\begin{equation}
\frac{\partial L_{}}{\partial z^{fg}_{i}} =  (1-p_i) \frac{2(P + Y + \varepsilon) - (2I + \varepsilon)}{(P + Y + \varepsilon)^{2}} = (1-p_i)\ G^{fg}(i),
\end{equation}

\begin{equation}
\frac{\partial L_{}}{\partial z^{bg}_{i}} =   -p_i\frac{(2I + \varepsilon)}{(P + Y + \varepsilon)^{2}} = -p_i\  G^{bg}(i),
\end{equation}

respectively. Here, the magnitude scales linearly with the error while maintaining adaptive, region-size-dependent foreground and background weighting. Notably, the global terms $G^{fg}$ and $G^{bg}$ are equal to the Dice formulation. Following the chain rule, where the partial derivative of the probability w.r.t. the logits is $\partial p_i/\partial z_i = (1-p_i)*p_i$, we would need a scalar loss that results in the following partial derivatives w.r.t. the single voxel probabilities:


\begin{equation}
\frac{\partial L_{}}{\partial p^{fg}_i} =  \frac{1}{p_i} \frac{2(P + Y + \varepsilon) - (2I + \varepsilon)}{(P + Y + \varepsilon)^{2}} = \frac{1}{p_i} G^{fg}(i),
\end{equation}

\begin{equation}
\frac{\partial L_{}}{\partial p^{bg}_{i}} =   -\frac{1}{(1-p_i)} \frac{(2I + \varepsilon)}{(P + Y + \varepsilon)^{2}} = -\frac{1}{(1-p_i)} G^{bg}(i),
\end{equation}

For these partial derivatives to form the gradient w.r.t. the logits $\nabla_\textbf{z}L$ for a scalar loss function $L$, we require symmetry of second derivatives, which is not guaranteed for all $z_i$ and $z_k$ as outlined in the proof in Appendix \ref{sec:app_proof}. We identify this as the reason no loss with the desired partials was previously proposed. Instead of relying on a scalar loss, we define a vector field $\mathcal{F}(\textbf{z})$ with 

\begin{equation}
\label{eq:field}
\mathcal{F}_i(\textbf{z}) =  \frac{p_i(1-p_i)}{(y_i p_i+(1-y_i)(1-p_i))} \frac{2y_i(P + Y + \varepsilon) - (2I + \varepsilon)}{(P + Y + \varepsilon)^{2}},
\end{equation}


that we use as an optimization objective for model training. Our gradient scale factor is $p_i$ when $y_i = 0$ and $1 - p_i$ when $y_i = 1$. This can be interpreted as either a "region-imbalance" weighted cross-entropy gradient, or, as a linearly error-weighted dice gradient without the sigmoid derivative. Implementation details of the methods are described in Appendix \ref{sec:app_implementation}. 

Empirically, we find that adding a relatively sharp decline near $0$ and $1$ results in higher performance. We add this sharp decline by multiplying by $(1-(1-p)^n)$ and $(1-p^n)$, where $n$ regulates the steepness of the decline. For $n\rightarrow\infty$ the function is essentially equivalent to $|y-p|$ for $0 < p < 1$, and $0$ for $p\in\{0,1\}$. Effectively, for $|y-p|$ values close to $0$, this has similarity to label smoothing for cross-entropy loss \cite{szegedy2016rethinking,muller2019does}, by reducing the incentive of the model to push probabilities to maximal certainty. Symmetrically, for $|y-p|$ close to $1$, this can be interpreted as de-emphasizing extremely confident errors, potentially improving robustness against obvious cases of label noise. An ablation on the exponential $n$ is displayed in section \ref{sec:abl_decline}. Including the decline terms, the vector field is defined as:

\begin{equation}
\label{eq:field}
\mathcal{F}_i(\textbf{z}) =  (1-(1-p)^n)(1-p^n) \frac{p_i(1-p_i)}{(y_i p_i+(1-y_i)(1-p_i))} \frac{2y_i(P + Y + \varepsilon) - (2I + \varepsilon)}{(P + Y + \varepsilon)^{2}},
\end{equation}


\subsection{Vector field stability}
The non-existence of a scalar loss function yielding the desired partial derivatives
%gradient composed of the desired partial derivatives from a scalar loss function 
implies that the logit vector field and, therefore, also the induced vector field on the network weights, is non-conservative. A non-zero curl of vector fields can result in difficulties during optimization that are well-studied, e.g., in the field of generative adversarial networks \cite{mescheder2017numerics}. However, the curl in our proposed vector field $\mathcal{F}(\textbf{z})$ is negligible compared to the diagonal terms, which prevents the problematic "orbiting" around solutions. A visualization of $\mathcal{F}(\textbf{z})$ compared to the gradient vector fields of other methods is displayed in Figure \ref{fig:vec_field} in Appendix \ref{sec:app_vec_field}. Moreover, $\mathcal{F}(\textbf{z})$ provides favorable theoretical properties that make it suitable for model training: First, the proposed vector field $\mathcal{F}(\textbf{z})$ is continuous and smooth on $\mathbb{R^n}$, since each component $\mathcal{F}_i(\textbf{z})$ is a smooth function, see Equation \ref{eq:field}. Second, the vector field always points towards the ground truth solution $\mathbf{g}$, since no sign flips occur for the components $\mathcal{F}_i(\textbf{z})$. These theoretical considerations, in conjunction with our empirical evaluation in Section \ref{sec:experimentation}, showcase the proposed solution's suitability for effective network training. Example training curves with different optimizers are displayed in Appendix \ref{sec:app_experimentation}.


\section{Experimentation and Results}
\label{sec:experimentation}
We compare our custom vector field adaptations for different region-based losses, including Dice, Tversky \cite{salehi2017tversky}, Combo loss (CE + Dice) \cite{taghanaki2019combo}, m1L1-ACE (+Dice) loss, and Dice++ losses. Moreover, we include baselines employing spatially aware label smoothing (SVLS) \cite{islam2021spatially} and neighbor-aware calibration through penalty constraints (NACL \cite{murugesan2025neighbor}). Notably, these were not designed to work in conjunction with region-based losses \cite{murugesan2023calibrating}. Details on the hyperparameter settings for these losses are listed in Appendix \ref{sec:app_experimentation}. We conduct a random hyperparameter search to find the optimal configuration for each setup with 25 and 10 runs for our 2D and 3D datasets, respectively. Implementation details for our optimization method are provided in the Appendix \ref{sec:app_implementation}. We use the UNet architecture \cite{ronneberger2015u} with residual units \cite{he2016deep} combined with heavy domain-specific augmentations \cite{isensee2021nnu}. \\

\input{tables/tab1_INbreast_FIVES_stats}

\noindent \textbf{Metrics}\quad
We evaluate model calibration using negative log-likelihood (NLL), expected calibration error (ECE), maximum calibration error (MCE) \cite{naeini2015obtaining,guo2017calibration}, and Brier score \cite{glenn1950verification}. Calibration metrics are calculated on all voxels; a comparison to a calculation on the "active" foreground region defined as the union of target and prediction foreground is displayed in Appendix \ref{sec:active}. Additionally, we report the Dice similarity coefficient (DSC) as an overlap-based metric. \\
\\ 

\noindent \textbf{Datasets} \quad
We perform experiments on datasets for 2D retinal vessel segmentation on the FIVES dataset \cite{jin2022fives}, for 2D mass segmentation in mammography images \cite{moreira2012inbreast}, for 3D metastasis segmentation on the BraTS-METS dataset \cite{maleki2025analysis}, and 3D tumor segmentation on the KiTS dataset \cite{heller2019kits19}. Detailed descriptions of the datasets and data splits are provided in Appendix \ref{sec:app_experimentation}.

\input{tables/tab2_results_3d}

\subsection{Results}
Tables \ref{tab:2d_results} and \ref{tab:3d_results} present our main results for the experiments on 2D and 3D datasets. Our proposed gradient vector field surgery, applied to the gradient of a region-based loss function, improves calibration metrics compared to the respective baseline losses alone (ComboLoss, Dice, and Tversky) across all cases in all datasets. In some cases, our approach reduces NLL and ECE by factors of 4 to 6, with negligible (FIVES, BraTS, KiTS) or positive (INbreast) impact on binary prediction performance. Furthermore, our approach applied to varying baseline losses consistently yields the best (INBreast, KiTS, BraTS) or second-best (FIVES, BraTS) calibration scores in all metrics. Especially CE + Dice with gradient vector field surgery performs strongly on all datasets in terms of calibration and DSC.


\begin{figure}[t]
    \centering
    \includegraphics[width=0.95\linewidth,trim={0cm 1.0cm 0 0cm},clip]{figures/fig3_brain_maps3.pdf}
    \caption{Visualization of the predicted probability maps as heat maps on the BraTS dataset (best viewed zoomed in). Yellow and red indicate medium and high foreground probability, respectively. Blue arrows indicate regions of overconfidence of the Dice model, while our approach exhibits well-calibrated predictions. The Dice model overconfidently predicts background (A, both arrows and B, right arrow) and foreground (B, left arrow).}
    \label{fig:prob_map}
\end{figure}

On the 2D datasets, the proposed surgery yields substantial calibration gains. For INbreast, where standard losses often exhibit high instability due to the dataset's inherent challenges (foreground size variability, low foreground-to-background differences, limited sample size), our method improves training, recovering DSC performance up to $74\%$. We furthermore observe strong calibration performance of the CE loss, stemming from its optimal calibration properties as described in Section \ref{sec:method_miscalibration}. However, this comes at the cost of low predictive performance, especially when facing highly imbalanced datasets, such as INbreast (DSC of $66\%$). On FIVES, where DSC scores are generally high ($88.03\%$), our method maintains segmentation accuracy while drastically reducing ECE and MCE. 

In the 3D domain, which presents challenges related to volumetric imbalance and label noise, our approach consistently yields better-calibrated models without compromising segmentation accuracy. While all models trained with losses containing Dice components achieve comparable DSC on BraTS and KiTS, our gradient surgery majorly reduces the MCE and NLL compared to the baseline losses. The Dice++ is notably strong on our 3D datasets; however, it is overall still inferior to the proposed gradient surgery, particularly regarding NLL and ECE. Similarly to our 2D experiments, CE-based loss functions (CE, SVLS, NACL) yield well-calibrated models that show weaker Dice performances because of the datasets' high region-imbalance. Combining these losses with a Dice component drastically improves DSC scores, while having an adverse effect on calibration. Notably, NACL shows poor calibration performance when evaluated on all pixels because it is underconfident in background regions. When evaluating on active regions alone (Appendix \ref{sec:active}), NACL yields calibration comparable to our method. 


\subsection{Ablation on exponential decline factor}
\label{sec:abl_decline}
To investigate the impact of the exponential $n$ contained in the multiplicative terms ($(1-(1-p)^n)$ and $(1-p^n)$) we perform an ablation study on the fives dataset with $n\in\ \{1, 2, 5, 20, 40, 60, 80, 100, 200, 1000\}$ and other hyperparameters fixed. The experiment shows that calibration and region overlap performance increase until $n=20$ (see Table \ref{tab:ablation_exponential}). For larger $n$, only minor differences in all metrics are observable, displaying robust performance across different values for the exponential $n$, above a certain threshold.

\input{tables/tab3_ablation_exponential}


\section{Conclusion}
In this work, we theoretically analyze the partial derivatives of widespread region-based loss functions and show their formal connection to network calibration. We identify how the Dice/Tversky loss is incentivized to produce overconfident predictions and propose gradient surgery as a simple solution. This "surgery" combines the benefits of gradients that scale with error magnitude with robustness to region imbalance. Instead of relying on a scalar loss, we directly define vector fields at the level of the logits as loss surrogates and prove how they cannot be formalized as scalar loss functions. While this comes at the expense of desirable theoretical guarantees due to the non-conservative nature of the vector fields, we theoretically and empirically demonstrate that our defined vector fields possess favorable properties for model training. Our method drastically improves calibration metrics across diverse medical segmentation datasets in 2D and 3D, including metastasis segmentation, where calibrated outputs provide valuable insights into borders and potential emergence of metastasis. Future work should focus on two directions: first, deriving theoretical bounds for the stability of such non-conservative gradient fields; second, exploring the utility of better-calibrated networks in clinical practice. Ultimately, this approach provides a generalizable mechanism for training uncertainty-aware segmentation networks, a prerequisite for trustworthy clinical decision support.

% Acknowledgments---Will not appear in anonymized version
\midlacknowledgments{This work was partially supported by the German Federal Ministry of Research, Technology, and Space (BMFTR) as part of the Software Campus 3.0 (TU München) under grant number 01IS23069.}

\bibliography{midl26_317}


\newpage
\appendix

\section{Proof of the non-existence of a corresponding scalar loss function}
\label{sec:app_proof}

A corresponding scalar function only exists for conservative vector fields. We show that the vector field $\mathcal{F}(\textbf{z})$ is non-conservative and therefore no corresponding scalar loss function exists. For $\mathcal{F}(\textbf{z})$ to be conservative, there must exist a potential function $L$ (the desired loss function) such that:

$$\frac{\partial^2 L}{\partial z_i \partial z_k} = \frac{\partial^2 L}{\partial z_k \partial z_i}$$

\noindent With the desired partial derivatives, we have 
$$\frac{\partial L}{\partial z_i}  =  \frac{p_i(1-p_i)}{(y_i p_i+(1-y_i)(1-p_i))} \frac{2y_i(P + Y + \varepsilon) - (2I + \varepsilon)}{(P + Y + \varepsilon)^{2}}, $$

\noindent and 

$$\frac{\partial L}{\partial z_k}  =  \frac{p_k(1-p_k)}{(y_k p_k+(1-y_k) (1-p_k))} \frac{2y_k(P + Y + \varepsilon) - (2I + \varepsilon)}{(P + Y + \varepsilon)^{2}}, $$

\noindent with the second-order partial derivatives:

$$\frac{\partial^2 L}{\partial z_k \partial z_i} =  \frac{p_i(1-p_i)}{(y_i p_i+(1-y_i)(1-p_i))} \cdot \left[\frac{-2y_i}{(P + Y + \varepsilon)^2} - \frac{2y_k(P + Y + \varepsilon) - 2(2I + \varepsilon)}{(P + Y + \varepsilon)^3}\right] \cdot p_k(1 - p_k)$$

$$\frac{\partial^2 L}{\partial z_i \partial z_k} =  \frac{p_k(1-p_k)}{(y_k p_k+(1-y_k)(1-p_k))} \cdot \left[\frac{-2y_k}{(P + Y + \varepsilon)^2} - \frac{2y_i(P + Y + \varepsilon) - 2(2I + \varepsilon)}{(P + Y + \varepsilon)^3}\right] \cdot p_i(1 - p_i)$$

\noindent taking e.g. $y_i = y_k = 0$

$$\frac{\partial^2 L}{\partial z_k \partial z_i} =  p_i  \frac{ - 2(2I + \varepsilon)}{(P + Y + \varepsilon)^3} \cdot p_k(1 - p_k)$$

$$\frac{\partial^2 L}{\partial z_i \partial z_k} =  p_k  \frac{ - 2(2I + \varepsilon)}{(P + Y + \varepsilon)^3} \cdot p_i(1 - p_i)$$


$$\frac{\partial^2 L}{\partial z_k \partial z_i}  = \frac{\partial^2 L}{\partial z_i \partial z_k} $$

only if 

$$1-p_k = 1-p_i.$$

This requires $p_k = p_i$ and is obviously not true for arbitrary $p_k$ and $p_i$.

\newpage
\section{Vector/gradient fields of different loss functions}
\label{sec:app_vec_field}

Figure \ref{fig:vec_field} visualizes the gradient field w.r.t to the logits $\textbf{z}$ for different loss functions, in addition to our proposed vector field. The fields are depicted in probability space for two variables $p_1$ and $p_2$ for better visualization, although the vector represents the gradient derivatives w.r.t. to the logits. For losses influenced by global statistics (all but CE), we assume an imbalanced example with 2 foreground voxels (y = 1) and 98 background voxels (y = 0). With assumed probability values of 0.8 for foreground voxels and 0.1 for background voxels. The displayed gradient/vector fields add 2 additional voxels $y_1=1$ and $y_2=2$, and show the gradient on their logits for different $p_1$ and $p_2$ values, while the probabilities/logits for the other 100 voxels stay unchanged. 

\begin{figure}
    \centering
    \includegraphics[width=0.75\linewidth]{figures/app1_vector_fields.pdf}
    \caption{Gradient/vector fields w.r.t. logits of different loss functions and our proposed vector field. For better visualization, the axes are displayed as probabilities $p$ instead of logits $z$.}
    \label{fig:vec_field}
\end{figure}

\FloatBarrier
\section{Experimentation details}
\label{sec:app_experimentation}




\subsubsection*{Datasets}
The INbreast dataset \cite{moreira2012inbreast} contains 107 images with Masses. We separate 22 ($\sim20\%$ of total) images for the test set. Validation metrics for model selection are calculated on 17 ($=20\%$ of remaining) of the remaining images. We resize the images to a resolution of $512x\times512$ for model training.

\noindent The FIVES dataset \cite{jin2022fives} contains 800 fundus images with vessel annotations. We rescale images to a resolution of $1024\times1024$, and train on the center and evaluate on the central patch of $512\times512$ voxels. We separate 200 images ($=20\%$ of total) for testing. Of the remaining data, we use 120 ($=20\%$ of remaining) images as validation set.

The original BraTS Metastasis dataset \cite{maleki2025analysis} comprises a retrospective collection of 1296 pre- and post-treatment brain metastases, labeled in four classes: nonenhancing tumor core, FLAIR hyperintensity, enhancing tumor, and resection cavity. Each sample has four input channels (T1, T1c, T2, FLAIR). We use a random, representative subset of 156 cases for training, 44 for validation, and 251 for testing. As input, we use only the T1c scan, disregarding the others. The images vary in size, orientation, and spacing. We preprocess each image in a nnUNet-style fashion \cite{isensee2021nnu} with reorientation to RAS+, resampling to an isotropic spacing of $1mm$, and z-score normalization. The resulting volumes have a median shape of $[141\times175\times142]$. We convert the labels to a binary format, where enhancing and non-enhancing tumor tissue is foreground and FLAIR hyperintensity, resection cavity, and healthy tissue is background. The lesions account for $0.17\%$ of the total voxels with a standard deviation of $0.29\%$ per sample. During training, we extract a random patch of size [$80\times96\times80$] with a foreground oversample ratio of $0.33$.

The KiTS dataset \cite{heller2019kits19} comprises 489 abdominal CT scans where kidneys, renal tumors, and renal cysts are labeled. The images vary in size, orientation, and spacing. We preprocess each image in a nnUNet-style fashion \cite{isensee2021nnu} with reorientation to RAS+, resampling to an isotropic spacing of $1.5mm$, and intensity-clipping to the [$0.5,99.5$] percentiles (i.e., [$-58,302$] HU) followed by z-score normalization. We extract ROIs of varying sizes with a median of [$218\times130\times160$] around both kidneys. We convert the labels to a binary format, where the tumors are foreground and the kidneys, cysts, and the rest are background. The tumors account for $0.68\%$ of the total voxels with a standard deviation of $1.18\%$ per sample. We stratify the complete dataset into 192 training, 50 validation, and 247 test sets, which are balanced in terms of size and number of tumors. During training, we extract 8 random patches per sample, with a foreground oversample ratio of 0.45 and a fixed size of [$96\times96\times80$] for each patch.

\subsubsection*{Loss hyperparameters}
For the Tversky loss, we set the $\alpha$ parameter ($\beta = 1-\alpha$) as a hyperparameter. For Combo loss \cite{taghanaki2019combo}, we fix the weighting to 0.5 \cite{isensee2021nnu}. For Dice++, we set the $\gamma$ parameter to $2$ \cite{yeung2023calibrating}. For SVLS, we set the kernel size to 3 and use $\sigma$ as a hyperparameter with possible settings of 1, 2, and 3 \cite{islam2021spatially}. For NACL, we use the penalty formulation, set the balancing parameter to 0.1 ($\lambda$), the kernel size to 3, use a mean prior ($\tau$), and use an L1 penalty\cite{murugesan2025neighbor}. Finally, for the mL1-ACE loss, we use equal weighting with Dice loss and use 20 bins to discretize the probability space \cite{barfoot2024average}.

\subsubsection*{Model and Training Procedure}
Our 3D experiments utilize a full-resolution 3D UNet with residual units, following a pipeline heavily influenced by nnUNet \cite{isensee2021nnu}, particularly in terms of network size, learning rate schedule, iterations, augmentations, and optimizer. We use SGD with Nesterov momentum as an optimizer. In our hyperparameter optimization, we further optimize for weight decay, initial learning rate, and momentum. At test time, we do sliding window inference on the complete volume with an overlap ratio of 0.5 and Gaussian weighting.

\subsubsection*{Training curves stability}
Figure \ref{fig:training_curves} shows training and validation curves with different optimizers. 

\begin{figure}[h]
    \centering
    \includegraphics[width=0.8\linewidth]{figures/app2_training_curves.pdf}
    \caption{Training curves with SGD+Nesterov Momentum and Adam optimizers for a dice loss training and a model with the same hyperparameters trained with our dice surgery training.objective }
    \label{fig:training_curves}
\end{figure}

\newpage
\section{Implementation details for custom vector fields}
\label{sec:app_implementation}

In practice, we implement our custom ("gradient") vector field for the different region-based losses by overwriting the backward pass of the softmax activation. The forward pass through the softmax remains unchanged. We keep the partial derivatives of the Loss w.r.t. the probabilities and exchange the partial derivative of the probabilities w.r.t. the logits to reflect our desired vector field, e.g.\@, by replacing $p(1-p)$ with $|y-p|$ and adding the sharp decline term close to $p=0$ and $p=1$, $(1-(1-p)^n)$ and $(1-p^n)$. The actual implementation is shown in the following listing.

\begin{lstlisting}[language=Python, caption={Implementation of GradSurgeSoftmax}, label={lst:gradsurge}]
class GradSurgeSoftmax(torch.autograd.Function):
    @staticmethod
    @custom_fwd(device_type="cuda", cast_inputs=torch.float32)
    def forward(ctx, logits, targets, exponential_correction=None):
        probs = torch.softmax(logits, dim=1)

        error = torch.abs(probs - targets)
        
        if exponential_correction is not None:
            # Applying the correction term
            error_weight = 0.25 * error * (1 - torch.pow(error, exponential_correction)) * (1 - torch.pow(1 - error, exponential_correction))
        else:
            error_weight = 0.25 * error

        ctx.save_for_backward(error_weight)
        return probs

    @staticmethod
    @custom_bwd(device_type="cuda")
    def backward(ctx, grad_output):
        error_weight, = ctx.saved_tensors

        grad_output = grad_output.to(error_weight.dtype)

        # Assuming binary segmentation (background vs foreground)
        weight = error_weight[:, 1:2]  
        grad_p_bg = grad_output[:, 0:1]
        grad_p_fg = grad_output[:, 1:2]

        coupling = (grad_p_fg - grad_p_bg)

        grad_logits_bg = -weight * coupling
        grad_logits_fg = weight * coupling

        return torch.cat([grad_logits_bg, grad_logits_fg], dim=1), None, None, None
\end{lstlisting}



\newpage
\FloatBarrier
\section{Dice++ gradient}
\label{sec:app_dice++}

The Dice ++ loss 

$$DSC++ = 1-\frac{2 \sum_{i=1}^{N} p_i y_i + \epsilon}{2\sum_{i=1}^{N} p_i y_i + \sum_{i=1}^{N} (p_i (1-y_i))^\gamma + \sum_{i=1}^{N} ((1-p_i) y_i)^\gamma + \epsilon}$$

was proposed to resolve the calibration issues of the dice loss by introducing a focus $\gamma$ on false positives and false negatives.

$$\frac{\partial L_{DSC++}}{\partial p^{fg}_{i}} =  \frac{ -2[\gamma(1-p_i)^{\gamma-1}I +FP^{\gamma}+ FN^{\gamma}]}{(2I + 2p_i + (1-p_i)^{\gamma}+ FP_{-i}^{\gamma}+ FN_{-i}^{\gamma})^{2}} $$

$$\frac{\partial L_{DSC++}}{\partial p^{bg}_{i}} = \frac{  2\gamma p_i^{\gamma-1}2I}{(2I + p_i^{\gamma}+ FP_{-i}^{\gamma}+ FN_{-i}^{\gamma})^{2}} $$

\noindent Exactly, for $\gamma = 2,$ the partial derivative of the Dice++ loss w.r.t. the probabilities depends linearly on the error as for MSE loss on the probabilities ($\nicefrac{\partial L_{DSC++}}{\partial p^{fb}_{i}} = 2p_i$) while the global term for $y=0$ and $y=1$ is the most similar to the original dice loss at $\gamma=2$ compared to higher $\gamma$'s. Which we identify as the reason for the optimal performance in terms of Dice and calibration metrics for $\gamma= 2$. \\


\noindent Besides this desired property, the $\gamma$ parameter introduces a vast downscaling of the gradient for samples with large proportions of false positives and false negatives. This can be problematic for cases where (1) the foreground regions have drastically different sizes and, in connection to that, drastically different values for false positives and false negatives, and (2) for cases where learning from "hard" examples characterized through high false positive and false negative rates is crucial. On the contrary, previous works also showed that in some cases, focus on easy examples can be beneficial for segmentation performance \cite{abraham2019novel}.


\FloatBarrier
\section{Partial derivative function visualization}
\label{app:function_vis}
\begin{figure}[h]
    \centering
    \includegraphics[width=0.8\linewidth]{figures/app3_partials_functions.pdf}
    \caption{Visualization of the normalized partial derivatives derived from different loss functions. }
    \label{fig:partial_function_plots}
\end{figure}




\FloatBarrier
\newpage
\section{Qualitative examples}
\label{sec:qual}

\begin{figure}[h]
    \centering
    \includegraphics[width=0.7\linewidth]{figures/app4_fives_prob_maps_examples.pdf}
    \caption{Comparison of the probability maps for two samples from a Dice loss trained model and our adapted Dice vector field approach. The overconfidence phenomenon on the Dice probabilities is apparent, with probabilities almost appearing binarized. In comparison, our method has probabilities other than 0 and 1, which often reflect plausible vascular courses (e.g., purple ellipses) that are missing in the probability maps of the Dice loss-trained model.}
    \label{fig:qual_vessel}
\end{figure}

\FloatBarrier
\newpage
\section{Active region evaluation}
\label{sec:active}

Tables \ref{tab:combined_active_results} and \ref{tab:3d_active_region} show an evaluation of the calibration metrics on the "active" foreground region defined as the union of target and (binarized) prediction foreground regions \cite{murugesan2025neighbor}. Evaluation only on the active region results in a stark difference for the absolute values on calibration metrics. However, we see that the trends in the improvement achieved through our method remain unchanged.

\input{tables/tab4_acive_region_combined}
\input{tables/tab5_kits_active_region}

\newpage
\FloatBarrier
\section{Effect on different foreground sizes}
Table \ref{tab:kits_lesion_metrics} summarizes the detection performance and ECE for different tumor sizes on the Kits dataset. A tumor is considered detected if at least one voxel within its label area is correctly predicted as foreground. Very small foreground components (below $4cm^3$) are not included in the comparison as they are assumed to constitute label noise \cite{berger2025pitfalls}. We observe consistent calibration improvements when applying our proposed gradient field surgery across all tumor sizes. Furthermore, our approach improves the detection of small tumors slightly more than it does for large and very large tumors. We hypothesize that improved calibration is most beneficial for small tumors, where model uncertainty is naturally higher; in these borderline cases, accurate probability estimates are critical for successful detection because they help push the predicted probabilities of these difficult cases close to the detection threshold, thereby improving detection.

ECE values are calculated on the lesion foreground pixels and averaged across lesions of each size category.

\input{tables/tab6_kits_lesion_sizes}

\FloatBarrier
\section{Experiments with transformer architecture}

In addition to the main experiments (nn-unet style training), we perform experiments with the state-of-the-art Primus transformer architecture \cite{wald2025primus} for image segmentation. The results in \ref{tab:kits_transformer} show that vector field surgery also works in combination with transformer architectures, yielding notably better calibration scores. However, the overall performance of the transformer approach was poor. In medical image segmentation, convolutional approaches have proven more effective in numerous extensive evaluation studies \cite{isensee2021nnu,isensee2024nnu,wald2025primus}.

\input{tables/tab7_kits_transformer}


\FloatBarrier
\newpage
\section{Effect on logit values}
\label{sec:logit_vals}
Table \ref{tab:combined_logit_results} shows the average logit values in target foreground and background regions. Our method reduces the logit distance and the absolute value of the logits. Especially for the FIVES dataset, standard region-based losses show very large logit values, indicating overconfidence. Reduced logit magnitudes were found to result in improved calibration scores in earlier works on label smoothing and logit constraints \cite{muller2019does,murugesan2025neighbor}. The analysis provides direct evidence that the vector field intervention is effective in resolving the overconfidence problem of region-based losses.

\input{tables/tab8_logits_combined}

\end{document}
