





\documentclass{midl} % Include author names

% The following packages will be automatically loaded:
% jmlr, amsmath, amssymb, natbib, graphicx, url, algorithm2e
% ifoddpage, relsize and probably more
% make sure they are installed with your latex distribution

\usepackage{mwe} % to get dummy images
\usepackage{graphicx}
\usepackage{diagbox}
\usepackage{array}
\usepackage{multirow}
\usepackage{xcolor}
\usepackage{titlesec}
\usepackage{booktabs}

%%%%%%%%% for tables %%%%%%%%%%
\usepackage{booktabs}

\usepackage{array}
\usepackage{multirow,multicol}
\usepackage{adjustbox} % automatically adjust table width
\usepackage{colortbl}
\usepackage{hhline}
\usepackage{arydshln} % for dashed lines

\usepackage{enumitem}
\newcommand{\revision}[1]{\textcolor{blue}{#1}}

% \usepackage[margin=1in]{geometry}
\jmlrvolume{-- 221}
\jmlryear{2026}
\jmlrworkshop{Full Paper -- MIDL 2026 submission}
\editors{Accepted for publication at MIDL 2026}


\newcommand{\myplot}[1]{\includegraphics[width=0.99\linewidth, clip]{#1}}
\newcommand{\mytopplot}[1]{\includegraphics[width=0.99\linewidth, trim=3 3.3 2.8 0 , clip]{#1}}

\titlespacing*{\section}{0pt}{\baselineskip}{0pt}
\titlespacing*{\subsection}{0pt}{\baselineskip}{0pt}
%\titlespacing*{\paragraph}{0pt}{0pt}{0pt}

% {0pt}{5.5ex plus 1ex minus .2ex}{4.3ex plus .2ex}


\title[Active Learning for Fair Brain Segmentation]{Exploring Entropy-based Active Learning for Fair Brain Segmentation}

 % Use \Name{Author Name} to specify the name.
 % If the surname contains spaces, enclose the surname
 % in braces, e.g. \Name{John {Smith Jones}} similarly
 % if the name has a "von" part, e.g \Name{Jane {de Winter}}.
 % If the first letter in the forenames is a diacritic
 % enclose the diacritic in braces, e.g. \Name{{\'E}louise Smith}

 % Two authors with the same address
 % \midlauthor{\Name{Author Name1} \Email{abc@sample.edu}\and
 %  \Name{Author Name2} \Email{xyz@sample.edu}\\
 %  \addr Address}

 % Three or more authors with the same address:
 % \midlauthor{\Name{Author Name1} \Email{an1@sample.edu}\\
 %  \Name{Author Name2} \Email{an2@sample.edu}\\
 %  \Name{Author Name3} \Email{an3@sample.edu}\\
 %  \addr Address}


% Authors with different addresses:
% \midlauthor{\Name{Author Name1} \Email{abc@sample.edu}\\
% \addr Address 1
% \AND
% \Name{Author Name2} \Email{xyz@sample.edu}\\
% \addr Address 2
% }

%\footnotetext[1]{Contributed equally}

% More complicate cases, e.g. with dual affiliations and joint authorship
\midlauthor{\Name{Ghazal Danaee\nametag{$^{1}$}} \orcid{0009-0000-0961-834X}
\Email{ghazal.danaee.1@ens.etsmtl.ca}\\
\Name{M\'elanie Gaillochet\nametag{$^{1,2,3}$}} \orcid{0000-0001-8143-2275} \Email{melanie.gaillochet.1@ens.etsmtl.ca}\\
\Name{Christian Desrosiers\nametag{$^{1}$}} \Email{christian.desrosiers@etsmtl.ca}\\
\Name{Herv\'e Lombaert\nametag{$^{1,2,3}$}} \orcid{0000-0002-3352-7533} \Email{herve.lombaert@polymtl.ca}\\
\Name{Sylvain Bouix\nametag{$^{1}$}} \orcid{0000-0003-1326-6054} \Email{sylvain.bouix@etsmtl.ca}\\
\addr $^{1}$ {\'E}cole de technologie sup\'erieure, Montr\'eal, QC, Canada \\
\addr $^{2}$ Polytechnique Montr\'eal, Montr\'eal, QC, Canada\\
\addr $^{3}$ Mila - Quebec AI Institute, Montr\'eal, QC, Canada
}

\begin{document}

\maketitle

\begin{abstract}
Active learning (AL) has emerged as a crucial strategy for reducing the prohibitive costs associated with medical image segmentation. However, standard uncertainty-based AL methods typically focus on maximizing performance metrics, ignoring performance disparities or fairness across groups with sensitive attributes. While fair active learning has been explored in classification tasks, its intersection with medical image segmentation remains unaddressed. In this work, we introduced a fairness-aware active learning framework with a \textit{Weighted Entropy} selection strategy that modulates uncertainty based on current group-specific performance estimates on the labeled set. To decouple true epistemic uncertainty from anatomical volume variances, we further utilized a masked, scaled entropy restricted to the region of interest. The framework was evaluated on synthetic T1-weighted brain MRIs with controlled left caudate bias in both strong and weak bias settings. A 3D U-Net was trained to segment the left caudate under several AL strategies, starting from both demographically balanced and strongly imbalanced initial labeled sets. Experiments demonstrated that our method markedly reduces performance disparities between groups compared to random sampling and standard uncertainty sampling. By prioritizing poorly segmented subgroups during the AL cycles, our method consistently achieved the highest equity-scaled performance and reduced the disparity metric by 75\% (strong bias) and 86\% (weak bias) relative to standard entropy at the final budget. Overall, this work is among the first studies on fair AL for medical image segmentation, offering an efficient strategy to train more equitable models in resource-constrained environments.
\end{abstract}


\begin{keywords}
Active learning, Brain MRI, Fairness, Segmentation.
\end{keywords}

\section{Introduction}
\label{sec:introduction}
Active learning (AL) has become a key strategy for addressing the problem of annotation in medical image segmentation. The success of deep learning models for segmentation relies heavily on high-quality data labeled voxel by voxel by experts, which is time-consuming and labor-intensive to obtain. 
Active learning targets this bottleneck by treating annotation as a limited resource. 
The process begins with a small labeled set and a large unlabeled pool. 
At each iteration, a model is trained on the current labeled data, an informativeness score is computed for each unlabeled sample, and a small batch of high-scoring samples is selected for expert annotation and added to the training set~\citep{BUDD2021102062}. 
This loop repeats until the labeling budget is exhausted. 
AL can substantially reduce annotation effort while maintaining segmentation accuracy, especially in the low-label regime~\citep{camilleri2024fair}.

However, medical image segmentation poses specific challenges for active learning. 
Unlike natural image datasets, medical image annotation cannot be easily crowdsourced; annotators must have substantial expertise, and privacy concerns further constrain data sharing~\citep{WANG2024103201}. 
The task is intrinsically high-dimensional, and naive uncertainty-based sampling tends to select many highly similar, redundant images or outliers~\citep{Munjal_2022_CVPR}. 
Representative-based and hybrid AL strategies mitigate this by encouraging diversity, but since they require computing distances or distributions in a learned feature space, they can be computationally expensive~\citep{gaillochet2023stochastic}.
Theoretically, an advantage of active learning could be the mitigation of bias. 
In a simulated fraud-detection task, \citet{Weerts2023ActiveSelectionBias} showed that standard uncertainty-based active learning can mitigate selection bias and improve fairness even without a fairness-specific design. By querying uncertain samples, the model explored underrepresented groups, reducing false positive disparities and yielding fairer predictions as a side effect.

Beyond pure performance, a critical emerging concern is fairness. 
Fairness in segmentation entails ensuring that the quality of the segmentation is comparable across groups defined by sensitive attributes, including race and sex. 
Although most of the fairness evaluation work in medical imaging has focused on classification~\citep{mehrabi2022surveybiasfairnessmachine}, recent segmentation studies show clear demographic disparities. 
In cardiac MR and orthopedic imaging, models trained on racially imbalanced data exhibit significantly different performance across racial groups~\citep{puyolanton2021fairnesscardiacmrimage,Puyol-Antón2022,lee2025investigation,siddiqui2024fair}. 
Similar effects have been reported for prostate and skin-lesion segmentation, where race and skin-tone imbalance in training data leads to reduced performance on black patients and darker skin types~\citep{alqarni2024investigation,bencevic2024skincolorbias}. 
In brain MRI, \citet{Ioannou2022DemographicBiasBrainMR} demonstrated that FastSurferCNN exhibits region-specific sex and race biases, noting that race-related disparities can exceed sex-related ones. 
Furthermore, our prior work showed that race matching between training set and test sets can substantially improve performance for some architectures but not others when segmenting the nucleus accumbens~\citep{Danaee2025InvestigatingDemographicBias}. 
These findings build on evidence that anatomical differences across sex and racial groups shape brain volumes and model behavior~\citep{Frazier2008,Dibaji2024,Isamah2010}. 
Fair segmentation, therefore, requires demographically balanced training cohorts, and equity-aware metrics such as Equity-Scaled Segmentation Performance (ESSP)~\citep{Tian2024FairSeg}. 


To our knowledge, fair active learning for segmentation in a single domain remains unexplored. \citet{Wang2025VLMFairADA} addressed a related but distinct problem: fairness in cross-domain medical image segmentation. Their method leveraged CLIP \cite{radford_LearningTransferableVisual_2021} to encode target-domain images and sensitive attributes. It also introduced an attribute-aware sampling strategy, coined FairAP, that enforces balanced annotation quotas across subgroups and selects representative samples in the VLM latent space.

\textbf{Our contribution}: In this work, we address group-wise fairness within the AL acquisition process for brain MRI segmentation. We introduce a novel weighted entropy strategy which modulates voxel-wise uncertainty with group-specific performance weights. These weights are derived from the current Dice score on the labeled set to prioritize samples from under-performing groups. To ensure the acquisition score reflects true epistemic uncertainty and not anatomical variance, we compute a scaled entropy within a dilated region of interest (ROI) mask. The scaled entropy focuses on boundary uncertainty and prevents larger structures from systematically biasing the selection process across demographic groups.

\section{Related works}
\label{sec:related_works}
\paragraph{Active Learning in Segmentation.~} 
We next summarize recent AL approaches for medical image segmentation. \citet{Atzeni2022Deep} targeted expected dice gain per unit of manual contour length (tracing effort) for histology, and suggested selecting a specific region of interest (ROI) in one of the images for manual delineation. \citet{Kim2024Active} used image-level uncertainty with redundancy control for brain tumors, while \citet{Boehringer2023Active} prioritized the most difficult BraTS cases while pseudo-labeling the easier ones. Additionally, \citet{Qu2024Rethinking}, with their DifABAL, selected a compact, representative labeled core in a diffusion-learned latent space. To address the issue of redundancy in uncertainty sampling, \citet{gaillochet2023stochastic} proposed active learning with stochastic batches. This simple but powerful add-on leverages randomness by generating batches of samples randomly and choosing the batch with the highest mean uncertainty, effectively improving diversity without complex computations.
\paragraph{Fair Active Learning.~}
While AL for segmentation is well-studied, previous studies regarding fair active learning focus mainly on classification tasks. \citet{ANAHIDEH2022116981} introduced an expected fairness metric to estimate the impact of each unlabeled sample on group-wise disparity. Their acquisition strategy prioritizes high-entropy instances while favoring those expected to reduce unfairness. Similarly, \citet{YANG2023175} attempted to balance model utility and fairness by querying informative instances for both class and group labels and utilizing a sensitive learner to infer missing attributes. \citet{Wang2022MitigatingBiasLimitedAnnotations} sought to improve classifier fairness by penalizing differences in true positive/false positive rates between groups and requesting more annotated samples from the worst-performing group. More recently, \citet{Fajri2024FALCUR} applied fair k-means to the most uncertain points to obtain clusters that reflect the overall group distribution. They then selected candidates across clusters using a composite score that combines uncertainty and representativeness. \citet{pang2024fairness} aimed to mitigate unfairness among groups without compromising accuracy, using group annotations only on a validation set to preserve privacy. They achieved this by evaluating the impact of each new case on validation accuracy and fairness through an expected risk analysis. Their overall goal was to construct a fairness-aware dataset after active sampling.


\section{Method}
\label{sec:methods}
\subsection{Data}
\label{sec:data}
To have control over the level of unfairness in the data, we generated synthetic T1-weighted brain MR images using the SimBA framework~\citep{Stanley2023SimBA}.
In this framework, images are derived by applying non-linear diffeomorphic transformations sampled from a learned space of deformations to a template image. 
The global deformations can be used to mimic “regular” anatomical variation and localized deformations can be utilized to show localized “bias or disease” effects. The deformation(s) applied to each case is unique and controlled by sampling from a principal component (PC) representation of deformation fields. 
For our experiment, we combined the global transformation with an additional localized deformation in the left caudate. 
We denote as Group 1 the cases generated with both the localized deformation and the global deformation, and as Group 2 the cases generated with the global deformation only.
The amount of localized deformation used to have a bias effect is varied by scaling the first component of the PC representation by a scalar sampled from $\mathcal{N}(\mu, \sigma)$.
This procedure enabled us to construct two bias-strength conditions across Groups~1 and~2: the ``strong bias'' dataset with $\mu\,{=}\,4$ and $\sigma\,{=}\,2$ and the ``weak bias'' dataset with $\mu\,{=}\,2$ and $\sigma\,{=}\,2$.
Ultimately, the weak bias dataset comprised 312 T1-weighted MRIs, including 156 cases exhibiting the bias effect and 156 cases without it. We generated a strong bias dataset of the same size (312 images), likewise balanced between biased and non-biased cases (156 each). All images had the resolution of \( 170\,{\times}\,170\,{\times}\,76 \) voxels with an isotropic voxel spacing of \( 1  \)mm.



\subsection{Weighted localized entropy}
\label{sec:weighted_entropy}
We introduce two main modifications to the naive use of entropy.
First, we limit the computation of entropy within a mask around the ROI. 
For each unlabeled candidate volume, let $\mathcal{R}$ denote the ROI and $H(v)$ the voxel-wise predictive entropy at voxel $v$. We define a masked, scaled entropy:
\begin{equation}
\label{eq:localized_entropy}
\hat{H} = \frac{1}{|\mathcal{R}|} \sum_{v \in \mathcal{R}} H(v),
\end{equation}
which averages uncertainty over the ROI while normalizing by the region size. This reduces the influence of trivial anatomical volume differences on the overall uncertainty score. 
In our case, $\mathcal{R}$ is a dilated mask around the predicted segmentation of the left caudate.

Second, a group-aware weighting scheme that re-weights the scaled entropy with group-specific weights, thereby prioritizing samples from groups on which the model underperforms. For each group $g$, we first compute a standardized performance score based on the Dice similarity coefficient (DSC) on the labeled set: $z_g = \frac{\overline{\mathrm{DSC}}_{\mathrm{all}} - \overline{\mathrm{DSC}}_{g}}{\sigma_{\mathrm{all}}} \,,$
where $\overline{\mathrm{DSC}}_{g}$ denotes the mean dice for group $g$, $\overline{\mathrm{DSC}}_{\mathrm{all}}$ is the mean dice computed over all labeled set, and $\sigma_{\mathrm{all}}$ is the corresponding standard deviation across the labeled set. Thus, groups with worse segmentation performance (lower $\overline{\mathrm{DSC}}_{g}$) yield larger $z_g$.
We then transform these standardized scores into normalized group weights via a softmax:
$w_g = \frac{\exp\left(z_g\right)}{\displaystyle \sum_{j} \exp\left(z_j\right)} \,.
$
The final acquisition score used for selection is then given by
\begin{equation}
\label{eq:weighted_entropy}
\mathrm{score}(x_g) = w_g \cdot \hat{H},
\end{equation}
so that uncertainty is explicitly re-weighted toward groups with relatively poorer DSC, ensuring that active learning focuses on groups where the model is currently less reliable.


\section{Experiment and Results}
\label{sec:results}

\subsection{Implementation details}
\label{sec:training_experiments}
\paragraph{Model architecture and training setup. } We have summarized detailed information about the network, active learning setup, and test data in Table~\ref{tab:experiment_conditions}.


\begin{table*}[htbp]
\centering
\caption{Summary of the network training setup, test data, and the active-learning configuration. Group~1 denotes cases with an additional localized deformation in the left caudate, while Group~2 contains only global deformation.}
\label{tab:experiment_conditions}
\setlength{\tabcolsep}{10pt}
\renewcommand{\arraystretch}{1.15}
\resizebox{\textwidth}{!}{
\begin{tabular}{@{}
  >{\raggedright\arraybackslash}p{0.34\textwidth}
  >{\raggedright\arraybackslash}p{0.30\textwidth}
  >{\raggedright\arraybackslash}p{0.32\textwidth}
@{}}
\toprule
\textbf{Network} & \textbf{Test data} & \textbf{Active learning} \\
\midrule

\begin{minipage}[t]{\linewidth}
\textbf{3D U-Net} (GroupNorm, 3D convolution, and ReLU blocks, with a sigmoid activation, 2-level configuration with feature maps of size 8, 16) \\[2pt]
\textbf{Optimizer:} Adam \\
\textbf{Epochs:} 200 \\
\textbf{Learning rate:} $10^{-4}$ \\
\textbf{Loss:} binary cross-entropy
\end{minipage}
&
\begin{minipage}[t]{\linewidth}
\textbf{Test sets:}\\
\textbf{Group~1:} $N=30$\\
\textbf{Group~2:} $N=30$\\
\textbf{Combined (Group~1 $\cup$ Group~2):} $N=30$ (15 from Group~1, 15 from Group~2)
\end{minipage}
&
\begin{minipage}[t]{\linewidth}
\textbf{Strategies:}
\begin{itemize}[leftmargin=1.2em, nosep]
  \item Random sampling
  \item Mean entropy sampling
  \item Localized entropy sampling
  \item Weighted localized entropy sampling
\end{itemize}
\textbf{AL schedule:} 5 full AL cycles (10--86 labeled) \\
\textbf{Batch size:} $b=4$
\end{minipage}
\\

\bottomrule
\end{tabular}
}
\end{table*}




\paragraph{Baseline Experiments. }
We first establish baseline fairness by training our model under three training set compositions. The composition details are summarized in Table~\ref{tab:group_performance_summary}.
For both the strong and weak bias datasets, the test set comprised 30 cases, including 15 cases from Group 1 and 15 cases from Group 2.
\paragraph{Active Learning. }
In all AL experiments, we start with 10 labeled data, select 4 new samples to label at each AL iteration, and perform five complete AL cycles. We investigate three scenarios with varying proportions of Group~1 and Group~2 images in the initial training set. The corresponding compositions are reported in Table~\ref{tab:training_data_bias}.


\begin{table}[htbp]
\centering
\caption{Training data configuration by bias strength. Group~1 denotes cases with an additional localized deformation in the left caudate, while Group~2 contains only global deformation.}
\label{tab:training_data_bias}
\setlength{\tabcolsep}{10pt}
\renewcommand{\arraystretch}{1.15}
\resizebox{\textwidth}{!}{
\begin{tabular}{@{}
  >{\raggedright\arraybackslash}p{0.22\linewidth}
  >{\raggedright\arraybackslash}p{0.74\linewidth}
@{}}
\toprule
\textbf{Bias strength} & \textbf{Training data} \\
\midrule

\textbf{Strong bias} &
\begin{minipage}[t]{\linewidth}
\textbf{Initial labeled set:} $N_0 = 10$\\
\textbf{Initial group proportions (Group~1/Group~2):}
\begin{itemize}[leftmargin=1.2em, nosep]
  \item 50/50
  \item 80/20
  \item 20/80
\end{itemize}
\end{minipage}
\\
\addlinespace[6pt]

\textbf{Weak bias} &
\begin{minipage}[t]{\linewidth}
\textbf{Initial labeled set:} $N_0 = 10$\\
\textbf{Initial group proportions (Group~1/Group~2):}
\begin{itemize}[leftmargin=1.2em, nosep]
  \item 50/50
  \item 80/20
  \item 20/80
\end{itemize}
\end{minipage}
\\

\bottomrule
\end{tabular}
}
\end{table}



We implemented and tested four different AL strategies on the same test set used in the baseline experiments. We compared our fairness-weighted localized entropy sampling with random sampling (RS), mean entropy sampling, and localized entropy sampling (eq.\eqref{eq:localized_entropy}).
We report DSC, ESSP, and $\Delta$ for each experiment.


\subsection{Evaluation metrics}
\label{sec:evaluation_metrics}
We used DSC to evaluate raw performance. Furthermore, to evaluate fairness in the model's results, we utilized the Equity-Scaled Segmentation Performance (ESSP) metric, originally proposed by \citet{Tian2024FairSeg}. Given $\text{DSC}_{overall}$, the average DSC over all cases, and $\text{DSC}_g$, the average DSC of group $g$, we first define $\Delta$ as the sum of absolute performance discrepancies across all groups:
\begin{equation}
\label{eq:delta}
\Delta = \sum_{g \in G} \Bigl| \text{DSC}_{overall} - \text{DSC}_g \Bigr|.
\end{equation}
ESSP is then computed by penalizing the overall performance with $\Delta$:
\begin{equation}
\label{eq:essp}
\text{ESSP} = \frac{\text{DSC}_{overall}}{1 + \Delta}.
\end{equation}
In essence, ESSP acts as a substitute for DSC, with a penalty for unfairness.


\subsection{Baseline experiments}
\label{sec:baseline_experiments}
In the baseline experiments, we use all the training data available to get baseline Dice score (DSC), ESSP and $\Delta$. Results are shown in Table \ref{tab:group_performance_summary}. Surprisingly, the balanced dataset does not lead to the best ESSP. Instead, a model trained exclusively on Group 1 (the deformed dataset) leads to better ESSP than the other scenarios. The baseline experiments characterize how different training cohort compositions (balanced, Group 1-only, Group 2-only) affect overall accuracy. They also quantify equity using ESSP and $\Delta$ in the absence of active selection. This provides a reference point to assess whether subsequent AL strategies offer fairness gains beyond what simple cohort design can achieve.

\begin{table}[htbp]
\centering
\caption{Segmentation performance (DSC) stratified by training cohort and evaluated on
Group 1, Group 2, and pooled test sets of the strong bias and weak bias datasets.
Group 1 denotes cases with an additional localized deformation in the left caudate,
while Group 2 contains only global deformation. The size of the training set is
written in parentheses.}
\label{tab:group_performance_summary}
\begin{small}
\setlength{\tabcolsep}{5pt}
\resizebox{\textwidth}{!}{
\begin{tabular}{l|ccccc}
\toprule
\multirow[b]{ 2}{*}{\textbf{Training}} &
\multicolumn{5}{c}{\textbf{Test}} \\
%\multirow[b]{2}{*}{\textbf{ESSP}} &
%\multirow[b]{2}{*}{\textbf{$\Delta$}} \\
\cmidrule(l{3pt}r{3pt}){2-6}
& \textbf{DSC(G1)} &  \textbf{DSC(G2)} & \textbf{DSC(G1$\cup$G2)} & \textbf{ESSP} & \textbf{$\Delta$}\\
\midrule
\textbf{Strong Bias} \\
%\midrule
Pooled ( 63 G1 + 63 G2 )  & 0.88 & 0.93 & 0.91 & 0.91 & 0.04 \\
Group 1 ( 126 )  & 0.89 & 0.88 & 0.89 & 0.89 & 0.01 \\
Group 2 ( 126 )  & 0.75 & 0.93 & 0.84 & 0.71 & 0.18 \\
\midrule 
\textbf{Weak Bias} \\
%\midrule
Pooled ( 63 G1 + 63 G2 )        & 0.90 & 0.91 & 0.90 & 0.89 & 0.01 \\
Group 1( 126 )    & 0.90 & 0.90 & 0.90 & 0.90 & 0.002 \\
Group 2 ( 126 )  & 0.86 & 0.91 & 0.88 & 0.84 & 0.04 \\
\bottomrule
\end{tabular}
}
\end{small}
\end{table}





\subsection{Active learning experiments}
Results for the experiments starting from a balanced dataset (5 samples from each group) are shown in the first row of Fig.~\ref{tab:ESSP} (ESSP). 
The 80/20 and 20/80 Group 1/Group 2 initializations are shown in rows 2 and 3, respectively. 
All methods start from the same training set and thus exhibit identical performance at the first cycle (10 labeled).
We also show $\Delta$ and DSC curves for the strong bias experiment (Fig.~\ref{fig:deltadicestrong}). 
Results are very similar for the weak bias experiment and can be found in the appendix.
Overall, weighted localized entropy outperforms all methods in terms of ESSP in all scenarios, followed closely by localized entropy, then random sampling. 
Global entropy performs worse than all methods by a relatively large margin.
In terms of raw performance as measured by DSC (Fig.~\ref{fig:deltadicestrong}), random sampling is often the best strategy, especially at the early stages of AL.
However, this naive strategy fails to perform as well in reducing $\Delta$ effectively, compared to localized entropy and the proposed weighted localized entropy (Fig.~\ref{fig:deltadicestrong}). 
The significant lowering of $\Delta$ by weighted localized entropy, while still remaining highly competitive in terms of DSC, allows it to achieve the top ESSP scores across most experiments.

\begin{figure}[htbp]
\centering
% Optimization settings:
\setlength{\tabcolsep}{2pt} % Small padding
\renewcommand{\arraystretch}{1.2} % Vertical spacing
\scriptsize % Smaller text size
% Two columns only (no first column)
\begin{tabular}{ >{\centering\arraybackslash}m{0.48\textwidth} >{\centering\arraybackslash}m{0.48\textwidth} }
\mytopplot{images_modified/strong_ESSP_Bal.png} &
\mytopplot{images_modified/weak_ESSP_Bal.png} \\
\myplot{images_modified/strong_ESSP_biased80.png} &
\myplot{images_modified/weak_ESSP_biased80.png} \\
\myplot{images_modified/strong_ESSP_unbias80.png} &
\myplot{images_modified/weak_ESSP_unbiased80.png} \\
\end{tabular}
\caption{\textbf{ESSP} under different initial training set compositions. First row: balanced initialization, second row: 80/20 Group 1/Group 2 ratio, third row: 20/80 ratio. Left column: strong bias dataset, Right column: weak bias dataset.}
\label{tab:ESSP}
\end{figure}

% \begin{figure}[htb]
% \centering
% \setlength{\tabcolsep}{1pt}
% \begin{tabular}{c c c}
% \rotatebox{90}{\hspace{4.5em}Balanced init.} &
% \includegraphics[width=0.48\linewidth, trim=3 3.3 2.8 0 , clip]{images_modified/strong_ESSP_Bal.png} &
% \includegraphics[width=0.48\linewidth, trim=3 3.3 2.8 0 , clip]{images_modified/weak_ESSP_Bal.png} \\

% \rotatebox{90}{\hspace{3em}80\% Group 1 init.} &
% \includegraphics[width=0.48\linewidth, trim=3 3.3 2.8 0 , clip]{images_modified/strong_ESSP_biased80.png} &
% \includegraphics[width=0.48\linewidth, trim=3 3.3 2.8 0 , clip]{images_modified/weak_ESSP_biased80.png} \\

% \rotatebox{90}{\hspace{3em}80\% Group 2 init.} &
% \includegraphics[width=0.48\linewidth, trim=3 3.3 2.8 0 , clip]{images_modified/strong_ESSP_unbias80.png} &
% \includegraphics[width=0.48\linewidth, trim=3 3.3 2.8 0 , clip]{images_modified/weak_ESSP_unbiased80.png} \\

% &
% (a) Strong bias &
% b) Weak bias \\
% \end{tabular}
% \caption{\revision{\textbf{ESSP} under different initial training set compositions. First row: balanced initialization, second row: 80/20 Group 1/Group 2 ratio, third row: 20/80 ratio. Results are given for the (a) strong bias and (b) weak bias datasets.} }
% %\textcolor{red}{CD: Suggestion to save space: only show the strongly biased results. Put the performance metrics as rows and balanced initializations as columns...}

% \label{tab:ESSP}
% \end{figure}

\paragraph{Selection dynamics and group composition.}
As demonstrated in the baseline experiments (Table \ref{tab:group_performance_summary}), the best ESSP is likely achieved by over-representing Group 1 in the training dataset. 
This is illustrated in Fig. \ref{tab:ratio}, where, under all scenarios, the weighted localized entropy consistently favors adding Group 1 samples to the training dataset.
One can also observe a link between localized entropy and fairness as this strategy also tends to select samples from Group 1, even though it does not explicitly account for fairness.
Random sampling behaves as expected, balancing data 50/50 over time, while global entropy behaves counterintuitively by adding more samples from Group 2.


\begin{figure}[htbp]
\centering
% 1. Remove padding between columns
\setlength{\tabcolsep}{0pt} 
\renewcommand{\arraystretch}{0.5} % Reduce vertical gap between images if needed
% 2. Define columns: @{} removes side margins. 
% We use 'c' (center) to let the image define the width, or 'p' to force width.
% Using 0.5\textwidth allows the images to touch.
\begin{tabular}{ @{} p{0.5\textwidth} @{} p{0.5\textwidth} @{}}

    \mytopplot{images_modified/strong_delta_Bal.png} &
    \mytopplot{images_modified/strong_dice_Bal.png}
    \tabularnewline

    \myplot{images_modified/strong_delta_biased80.png} &
    \myplot{images_modified/strong_dice_biased80.png}
    \tabularnewline

    \myplot{images_modified/strong_delta_unbias80.png} &
    \myplot{images_modified/strong_dice_unbias80.png}
    \tabularnewline
\end{tabular}
\caption{$\mathbf{\Delta}$ and \textbf{DSC} metrics under different initial training set compositions for the strong bias experiment only. First row: balanced initialization, second row: 80/20 Group 1/Group 2 ratio, third row: 20/80 ratio. Left column: $\Delta$, Right column: DSC.}
\label{fig:deltadicestrong}
\end{figure}

% \begin{figure}[htb]
% \centering
% \setlength{\tabcolsep}{1pt}
% \begin{tabular}{c c c}
% \rotatebox{90}{\hspace{4.5em}Balanced init.} &
% \includegraphics[width=0.48\linewidth, trim=3 3.3 2.8 0 , clip]{images_modified/strong_delta_Bal.png} &
% \includegraphics[width=0.48\linewidth, trim=3 3.3 2.8 0 , clip]{images_modified/strong_dice_Bal.png}\\

% \rotatebox{90}{\hspace{3em}80\% Group 1 init.} &
% \includegraphics[width=0.48\linewidth, trim=3 3.3 2.8 0 , clip]{images_modified/strong_delta_biased80.png} &
% \includegraphics[width=0.48\linewidth, trim=3 3.3 2.8 0 , clip]{images_modified/strong_dice_biased80.png}\\

% \rotatebox{90}{\hspace{3em}80\% Group 2 init.} &
% \includegraphics[width=0.48\linewidth, trim=3 3.3 2.8 0 , clip]{images_modified/strong_delta_unbias80.png} &
% \includegraphics[width=0.48\linewidth, trim=3 3.3 2.8 0 , clip]{images_modified/strong_dice_unbias80.png}\\

% &
% (a) $\mathbf{\Delta}$ &
% b) DSC \\

% \end{tabular}
% \caption{\revision{(a) $\mathbf{\Delta}$ and (b) \textbf{DSC} metrics under different initial training set compositions for the strong bias experiment only. First row: balanced initialization, second row: 80/20 Group 1/Group 2 ratio, third row: 20/80 ratio.}}
% \label{fig:deltadicestrong}
% \end{figure}

\begin{figure}[htbp]
\centering
% Optimization settings:
\setlength{\tabcolsep}{2pt} % Small padding
\renewcommand{\arraystretch}{1.2} % Vertical spacing
\scriptsize % Smaller text size
% Two columns only (no first column)
\begin{tabular}{ >{\centering\arraybackslash}m{0.48\textwidth} >{\centering\arraybackslash}m{0.48\textwidth} }
% Row 1
\includegraphics[width=0.99\linewidth]{images_modified/strong_ratio_Bal.png} &
\includegraphics[width=0.99\linewidth]{images_modified/weak_ratio_Bal.png} \\
% Row 2
\includegraphics[width=0.99\linewidth]{images_modified/strong_ratio_biased80.png} &
\includegraphics[width=0.99\linewidth]{images_modified/weak_ratio_bias80.png} \\
% Row 3
\includegraphics[width=0.99\linewidth]{images_modified/strong_ratio_unbiased80.png} &
\includegraphics[width=0.99\linewidth]{images_modified/weak_ratio_unbias80.png} \\
\end{tabular}
\caption{ \textbf{Group 1 ratio} in the training set after sampling for each cycle under different initial training set. First row: balanced initialization, second row: 80/20 Group 1/Group 2 ratio, third row: 20/80 ratio. Left column: strong bias dataset, Right column: weak bias dataset.}
\label{tab:ratio}
\end{figure}



\section{Discussion}
In this work, we investigated the intersection of active learning and fairness in medical image segmentation. 
We designed an acquisition algorithm to improve group-wise fairness rather than solely optimizing accuracy. 
The proposed Weighted Localized Entropy consistently achieved the strongest equity-aware performance across initialization regimes and bias strengths.
Unsurprisingly, localized entropy also reduced bias, although applying the group-fairness weights further yielded improvement in both fairness and accuracy. 
We note that random sampling usually acquired the best accuracy results, but the equity-scaled performance was harmed by the substantial group-wise performance disparity. 
Global entropy frequently became the worst-performing method in terms of accuracy and fairness as it tended to add more Group~2 cases over time, despite Group~1 being the morphology-challenged subgroup. 
This behavior can be because whole-volume uncertainty may be dominated by structural extent and incidental variability rather than meaningful model confusion at the target boundary. 
By computing entropy inside an ROI mask and normalizing by the ROI size (Eq.~\ref{eq:localized_entropy}), the localized entropy better isolates the uncertainty. 

The proposed method demonstrated robustness in both strong and weak bias scenarios. 
In the strong bias setting, we observed a reduction in $\Delta$ of approximately 75\% (0.0176 vs. 0.0692) relative to standard entropy at the final cycle. 
Notably, in the weak bias setting in which morphological differences are harder to detect, our method was even more effective relative to baselines, reducing disparity by approximately 86\%. 
This indicates that the weighted entropy signal is sensitive enough to detect and correct minor performance drifts.

As it was observed in Table \ref{tab:group_performance_summary}, the relationship between training composition and equitable performance is non-trivial. 
In the weak bias dataset, the Group~1-only configuration became the best fairness baseline (DSC=0.90, ESSP=0.90, $\Delta=0.002$), outperforming training on all groups (DSC=0.90, ESSP=0.89, $\Delta$=0.01). 
Overall, these outcomes suggest that the subgroup associated with more challenging morphology (Group~1) acts as a fairness anchor and overrepresenting it can reduce group disparity without severely compromising overall utility. 
This observation motivates the core design of our AL strategy, which adaptively increases the selection pressure toward the currently under-performing group.

%Most fair AL methods have been developed for classification and often rely on group labels or fairness or utility estimates computed on a separate validation set\citep{pang2024fairness}. 
We tackle the AL for segmentation with a lightweight mechanism that uses only labeled-set performance to estimate group weights and requires no additional fairness classifier or expensive representativeness modeling. 
Our approach is close to performance-driven group reweighting strategies proposed in classification \citep{Wang2022MitigatingBiasLimitedAnnotations}, but adapted to voxel-wise uncertainty.


Although one might suspect overfitting to Group~1, this is not supported in the classical sense: when trained only on Group~1, the model still generalizes well to Group~2, with Group~2 performance remaining close to Group~1. 
   Additionally, the performance differences when predicting Group~2 across all baseline experiments are only minor.

   We argue that the higher ESSP achieved by training on Group~1 only is not driven by overfitting, but by robust generalization from training with the more morphologically challenging Group~1 dataset. Group~1 includes both global inter-subject variability and additional localized deformations. A model trained on Group~1 learns features that remain valid when evaluated on Group~2, where the task is effectively easier because the localized deformations are absent. In contrast, training on pooled Group~1 and Group~2 data can encourage shortcut learning, where the model preferentially fits the easiest, most frequent patterns (Group~2) and underfits the complex Group~1 cases, consistent with the gaps observed in Table~\ref{tab:group_performance_summary}.

Importantly, Group~1-only training is not presented as a universally optimal deployment strategy. ESSP is not a pure accuracy metric: it explicitly penalizes between-group disparity through $(\Delta)$, defined as the sum of absolute deviations from the overall dice. Since Group~1 is the morphology-challenged subgroup by construction, the fairness gain under Group~1-only training increases. Group~2 performance remains high while the inter-group gap shrinks, and this is exactly what ESSP rewards. Moreover, the Table~1 results are computed on a held-out, balanced test set (15 subjects per group), so the effect is not memorization-based overfitting.

We agree that one limitation of this study is the use of a synthetic dataset, which we selected to explicitly control the presence and magnitude of morphological bias. This controlled setting enables a clear link between sampling strategy and performance bias.
Real-world medical data contains complex biases (e.g., scanner artifacts correlated with hospital demographics) that are harder to disentangle than anatomical deformations. 
\textit{Evaluating} our framework in real-world data scenarios requires (i) identification of a dataset with known biases to confirm that a measurable disparity exists, and (ii) reliable reference segmentations for training, ideally generated without performance biases. 
In practice, finding such datasets is extremely challenging. 
Automatic segmentations may carry performance biases making fairness evaluation challenging, and manual annotations by experts are scarce. Moreover, real cohorts may exhibit subtle or region-specific morphometric differences (or none at all), and the existence of such differences cannot be assumed a priori. 
These challenges led us to perform these experiments solely with synthetic data. Our results support the use of a weighted sampling strategy to avoid performance bias associated with group level attributes.
If one were to \textit{translate} or apply our framework in a real-world scenario, one would need to identify one or more sensitive attributes based on apriori hypotheses (structure X has been reported to be larger in sub-population Y) and guide the sampling strategy using the weighted localized entropy (eq. 2~\ref{eq:weighted_entropy}).


Another limitation of our work is that our weighting strategy relies on the availability of group labels for the labeled set. While it is a reasonable assumption in a controlled AL setup, extending this to scenarios where sensitive attributes are missing is a necessary future step.
\section{Conclusion}
We presented a fairness-aware active learning framework for brain MRI segmentation. Through using a performance-based weighting scheme and localized entropy, the proposed algorithm actively constructs a training set that prioritizes equity. 
This can be especially practical for deploying segmentation models in settings where both labeling budgets and fairness requirements are critical. 
Our study provides a robust foundation for advancing fair active-learning approaches in medical image segmentation.

\clearpage  % Acknowledgements, references, and appendix do not count toward the page limit (if any)
% Acknowledgments---Will not appear in anonymized version
%\midlacknowledgments{}

\bibliography{midl26_221}

\appendix
\clearpage
\section{$\Delta$ and DSC results for the weak bias experiments}

\begin{figure}[htbp]
\centering
% 1. Remove padding between columns
\setlength{\tabcolsep}{0pt} 
\renewcommand{\arraystretch}{0.5} % Reduce vertical gap between images if needed
% 2. Define columns: @{} removes side margins. 
% We use 'c' (center) to let the image define the width, or 'p' to force width.
% Using 0.5\textwidth allows the images to touch.
\begin{tabular}{ @{} p{0.5\textwidth} @{} p{0.5\textwidth} @{}}
    \mytopplot{images_modified/weak_delta_Bal.png} &
    \mytopplot{images_modified/weak_dice_Bal.png}
    \tabularnewline
    \myplot{images_modified/weak_delta_biased80.png} &
    \myplot{images_modified/weak_dice_biased80.png}
    \tabularnewline
    \myplot{images_modified/weak_delta_unbiased80.png} &
    \myplot{images_modified/weak_dice_unbiased80.png}
    \tabularnewline
\end{tabular}


\caption{Performance metrics under different initial training set compositions for the weak bias experiments. First row: balanced initialization, second row: 80/20 Group 1/Group 2 ratio, third row: 20/80 ratio. Left column: $\Delta$, Right column: DSC.}
\label{fig:deltadiceweak}
\end{figure}

% \begin{figure}[htb]
% \centering
% \setlength{\tabcolsep}{1pt}
% \begin{tabular}{c c c}
% \rotatebox{90}{\hspace{4.5em}Balanced init.} &
% \includegraphics[width=0.48\linewidth, trim=3 3.3 2.8 0 , clip]{images_modified/weak_delta_Bal.png} &
% \includegraphics[width=0.48\linewidth, trim=3 3.3 2.8 0 , clip]{images_modified/weak_dice_Bal.png}\\

% \rotatebox{90}{\hspace{3em}80\% Group 1 init.} & 
% \includegraphics[width=0.48\linewidth, trim=3 3.3 2.8 0 , clip]{images_modified/weak_delta_biased80.png} &
% \includegraphics[width=0.48\linewidth, trim=3 3.3 2.8 0 , clip]{images_modified/weak_dice_biased80.png}\\

% \rotatebox{90}{\hspace{3em}80\% Group 2 init.} & 
% \includegraphics[width=0.48\linewidth, trim=3 3.3 2.8 0 , clip]{images_modified/weak_delta_unbiased80.png} &
% \includegraphics[width=0.48\linewidth, trim=3 3.3 2.8 0 , clip]{images_modified/weak_dice_unbiased80.png}\\

% &
% (a) $\mathbf{\Delta}$ &
% b) DSC \\
% \end{tabular}

% \caption{\revision{Performance metrics (a) $\mathbf{\Delta}$ and (b) \textbf{DSC} under different initial training set compositions for the weak bias experiments. First row: balanced initialization, second row: 80/20 Group 1/Group 2 ratio, third row: 20/80 ratio.}}
% \label{fig:deltadiceweak}
% \end{figure}

\end{document}


% \subsection{Unbalanced initial training set (80\% Group 1)}
% 
% \textcolor{red}{CD: The way results are presented is a bit repetitive... Is there are a way to merge the section and instead emphasize the differences (instead of repeating the number in the tables/plots) ?}
% 
% Results for the experiment starting from an unbalanced dataset (8 Group 1 and 2 Group 2) are shown in the second row of Tables \ref{tab:ESSP} (ESSP), \ref{tab:delta} ($\Delta$), and \ref{tab:dice} (DSC). 

% \paragraph {Strong Bias Dataset Results}
% In evaluating active learning strategies under an 80/20 initial skew. Standard Entropy proved to be the least equitable approach (ESSP\,=\,0.843, $\Delta$\,=\,0.061, DSC=0.893). Random Sampling achieved the highest overall accuracy, but retain significant bias (ESSP=0.866, $\Delta$\,=\,0.046, DSC\,=\,0.906). In contrast, the proposed Weighted Entropy (scaled) strategy outperformed all strategies including Scaled entropy (ESSP\,=\,0.880 ,$\Delta$\,=\,0.02, DSC\,=\,0.904) in terms of fairness while maintaining competitive accuracy (ESSP\,=\,0.889 ,$\Delta$\,=\,0.009, DSC\,=\,0.897). 


%This superior performance is driven by a distinct selection dynamic: rather than attempting to equalize sampling quotas or following the standard Entropy tendency to sample the majority group, the proposed method exclusively targeted the underperforming subgroup (Group 1) across all cycles, demonstrating that group-aware reweighting is essential to concentrate annotation effort where model performance lags.

% We evaluated an imbalanced initial training set (8 Group 1 vs.\ 2 Group 2 labeled cases) to mimic realistic annotation bias. We compared Entropy, random sampling, entropy (scaled), and the proposed weighted entropy (scaled), averaged over five runs. Although overall accuracy ($I$) increased across all strategies, strategies' fairness results diverged. Entropy amplified inter-group disparity early and remained the least equitable. At the final cycle (86 labeled), it showed the largest gap ($\Delta \approx 0.061$) and the lowest $ESSP \approx 0.843$. Random sampling achieved high Dice ($I \approx 0.906$) but still retained disparity ($\Delta \approx 0.046, ESSP \approx 0.866$). The proposed weighted entropy (scaled) delivered the best results by achieving the smallest final disparity ($\Delta \approx 0.009$) and the highest $ESSP \approx 0.889$, while maintaining competitive Dice ($I \approx 0.897$). Overall, under an 80/20 Group 1-skewed initial training setup, group-aware reweighting is necessary to prevent disparity increase and to maximize fairness.

% \textbf{Selection dynamics and group composition.}
% In this section, we analyzed the queried batches in one specific experiment for each strategy. Despite starting from a Group 1-dominant training set, the strategies exhibited different behaviors. Entropy shifted toward selecting Group 2 cases: across all AL cycles, only 25 of 80 queried volumes were from Group 1. RS produced a near-balanced selection stream (39 Group 1 vs.\ 41 Group 2). In contrast, Entropy (scaled) and Weighted Entropy (scaled) sampled more frequently from Group 1. Entropy (scaled) selected 71 of 80 queried volumes from Group 1. Interestingly, Weighted Entropy (scaled) selected Group 1 exclusively across all cycles (80/80). This behavior is consistent with the observed group weighting derived from labeled-set performance:Group 1 remained the underperforming subgroup during training. Thus, the proposed method does not attempt to equalize sampling quotas. It concentrates annotation on the group for which segmentation quality lags behind the global average.


% \paragraph{Weak Bias Dataset Results}
% We also evaluated the AL strategies with an unbalanced initial training set on the weak bias dataset. similar to the strong bias experiment, random sampling had the highest accuracy, but failed to mitigate inequity (ESSP\,0.884=\,, $\Delta$\,=\,0.0154, DSC\,=\,0.898). Standard Entropy was the least robust strategy (ESSP\,=\,0.873, $\Delta$\,=\,0.0195, DSC\,=\,). 
% 86labeled,0.8895236622,0.0195116702,0.8725032165,0.0000000000,0.0000000000,0.0000000000

% Our proposed Weighted Entropy (scaled) strategy demonstrated superior equity-aware performance (ESSP\,=\,0.892, $\Delta$\,=\,0.00346, DSC\,=\,0.895) over other methods such as Scaled Entropy (ESSP\,=\,0.889, $\Delta$\,=\,0.006, DSC\,=\,0.895) 
% When compared to the results of strong bias dataset where standard Entropy exacerbated disparity to $\Delta \approx 0.061$ and was not stable, the proposed method exhibited stability across varying bias intensities as shown in \tableref{tab:delta}. Even in the weak bias setting, the proposed approach reduced final disparity by approximately $82\%$ relative to standard Entropy.
% \paragraph{Selection dynamics and group composition.}
% In the strong bias setting, Weighted entropy scaled further amplified Group 1 representation, approaching an extreme majority, while Entropy scaled also sustains a high Group 1 share but generally below the weighted approach. In contrast, RS gradually regresses toward 50\% ratio, and Entropy decreases Group 1 even more than RS. The weak bias results mirror this pattern: weighted entropy (scaled) and entropy (scaled) still maintain elevated Group 1 proportions with the proposed method selecting from Group 2 as well in 3 cycles. 

% We investigated an imbalanced initial training set on the weak bias dataset (8 Group 1 vs.\ 2 Group 2 labeled cases). All strategies acquired identical performance at 10 labeled (I=0.836, $\Delta=0.0226$, ESSP=0.819). By 30 labeled, both Entropy (scaled) and weighted Entropy (scaled) exhibited improved equity relative to Entropy. Specifically, Entropy (scaled) reduced disparity to $\Delta=0.00774$ (ESSP=0.873), while the proposed Weighted Entropy (scaled) further improved fairness with $\Delta=0.00641$ (ESSP=0.873). In contrast, Entropy achieved higher disparity ($\Delta=0.0102$, ESSP=0.856), and RS showed higher $\Delta$ at this stage ($\Delta=0.0172$), despite strong accuracy (I=0.887).
% At convergence (86 labeled), the weak bias dataset yielded smaller fairness gaps than the strong-bias counterpart, yet consistent method separation remained. Weighted Entropy (scaled) achieved the best equity-aware performance with I=0.895, $\Delta=0.00346$, and ESSP=0.892. Entropy (scaled) remained competitive in accuracy (I=0.896) but with a larger disparity ($\Delta=0.00688$, $ESSP=0.890$). RS achieved the highest dice among baselines (I=0.898) but retained higher inequity ($\Delta=0.0154$, ESSP=0.884), while Entropy showed the weakest fairness at the final cycle (I=0.890, $\Delta=0.0195$, ESSP=0.873). Relative to Entropy, the proposed method reduced final disparity by $\approx 82\%$ (0.00346 vs.\ 0.0195) and improved ESSP by $\approx 0.019$.
% Comparing these results to the strong bias dataset reveals a consistent ranking of strategies, though the amount of disparities differs in magnitude. In the strong bias experiments under the same 80\% Group 1 initial training set, Entropy resulted in a final disparity of $\Delta \approx 0.061$, whereas Weighted Entropy (scaled) successfully suppressed this to $\approx 0.009$. On the weak bias dataset, the proposed method achieves an even tighter fairness bound ($\Delta \approx 0.003$). Notably, while Entropy amplifies disparity in the strong bias dataset, it exhibits a non-monotonic behavior in the weak bias setting by initially improving before degrading, and this highlights its instability for fairness preservation across different setups. In both scenarios, Weighted Entropy (scaled) proves robust; it reduces the final $\Delta$ by roughly 85\% compared to standard Entropy in the weak bias case (0.003 vs 0.020).
% \subsection{Unbalanced initial training set (80\% Group 2)}
% Results for the experiment starting from an unbalanced dataset (2 Group 1 and 8 Group 2) are shown in the third row of Tables \ref{tab:ESSP} (ESSP), \ref{tab:delta} ($\Delta$), and \ref{tab:dice} (DSC). 
% \paragraph{Strong Bias Dataset Results}
% In the imbalanced scenario on the strong bias dataset (20/80), the model started with significant unfairness ($\Delta$\,=\,0.106, ESSP\,=\,0.774). At the final cycle, the strategies can be ranked by ESSP as follows: Weighted Entropy (scaled), Entropy (scaled), Random Sampling, and standard Entropy. Specifically, the proposed Weighted Entropy (scaled) yielded the best fairness outcome, with the lowest disparity and competitive overall accuracy (ESSP\,=\,0.881, $\Delta$\,=\,0.025, DSC\,=\,0.903). Entropy (scaled) was a close second, maintaining high overall accuracy but with a slightly larger gap (ESSP\,=\,0.880, $\Delta$\,=\,0.029, DSC\,=\,0.905) . Naive exploration via Random Sampling achieved the highest overall Dice but retained a significantly higher disparity (ESSP\,=\,0.866, $\Delta$\,=\,0.048, DSC\,=\,0.908).
% We next evaluated the inverse imbalanced initial training set (8 Group 2 vs.\ 2 Group 1 labeled cases).
% All strategies start from the same initial training set and seed and therefore show identical performance at the first cycle (10 labeled): 
% I=0.857, $\Delta=0.106$, and ESSP=0.774. 
% Entropy exhibited the weakest behavior under this setup, with disparity increasing early rather than decreasing. By 30 labeled volumes, it reached $\Delta \approx 0.132$ and $ESSP \approx 0.757$. This shows that uncertainty sampling does not correct the initial imbalance when Group 1 is scarce.
% In contrast, constraining uncertainty to the dilated left caudate and normalizing by region size substantially stabilized fairness. At 30 labeled, both entropy (scaled) and weighted entropy (scaled) reduced disparity to $\Delta \approx 0.042$, while improving ESSP to $\approx 0.851$.
% As labeling progressed, Weighted Entropy (scaled) provided a consistent advantage over the Entropy (scaled). At 50 labeled volumes, Weighted Entropy (scaled) achieved lower disparity than Entropy (scaled) ($\Delta \approx 0.032$ vs.\ $\Delta \approx 0.034$). This trend persisted to the final cycle (86 labeled), where Weighted Entropy (scaled) yielded the best fairness results with $I \approx 0.903$, $\Delta \approx 0.025$, and $ESSP \approx 0.881$. Entropy (scaled) remained competitive in overall accuracy ($I \approx 0.905$) but with a larger $\Delta$ ($\Delta \approx 0.029$, $ESSP \approx 0.880$). RS achieved the highest overall Dice ($I \approx 0.908$) yet retained higher disparity ($\Delta \approx 0.048$, $ESSP \approx 0.866$), confirming that naive exploration alone is insufficient to reliably restore equity when the sensitive subgroup is initially rare.
% \textbf{Selection dynamics and group composition.}
% Despite the initially scarce presence of Group 1, the strategies exhibited  different selection trajectories. Entropy preserved the Group 2-skewed sampling stream, querying only 27 of 80 volumes from Group 1 across the AL cycles. Consequently, the final labeled set remained Group 2-dominant (29 Group 1 vs.\ 61 Group 2 at 86 labeled). Random sampling produced a near-balanced acquisition pattern (40 Group 1 vs.\ 40 Group 2), yielding a rebalanced labeled pool (42 vs.\ 48).
% In contrast, Entropy (scaled) and Weighted Entropy (scaled) shifted selection toward Group 1. Entropy (scaled) queried 73 of 80 volumes from Group 1, resulting in a strongly Group 1-enriched labeled set at the final cycle (75 vs.\ 15). The proposed Weighted Entropy (scaled) had similar behavior, selecting Group 1 exclusively (80/80).}
% \paragraph{Weak Bias Dataset Results}
% We evaluated the AL strategies in unbalanced initial training set of weak bias data. At the final cycle, the fairness-based ranking mirrored that of the strong-bias dataset, with the proposed Weighted Entropy (scaled) strategy emerging as the fairest, followed by Entropy (scaled), Random Sampling, and standard Entropy.
% %The proposed Weighted Entropy (scaled) achieved the optimal balance of performance and equity ($I=0.896$, $\Delta=0.00373$, $ESSP=0.892$), demonstrating a further refinement over Entropy (scaled) ($I=0.896$, $\Delta=0.00863$, $ESSP=0.888$). Although Random Sampling achieved the highest dice ($I=0.897$), it retained nearly five times the disparity of our method ($\Delta=0.0176$, $ESSP=0.882$). Relative to the weakest baseline, standard Entropy ($I=0.887$, $\Delta=0.0256$, $ESSP=0.865$), the Weighted Entropy (scaled) strategy reduced the final disparity $\Delta$ by $\approx 85\%$.
% \paragraph{Selection dynamics and group composition.}
% In the strong bias, Weighted entropy (scaled) shows the largest rise in Group 1 share from ~20\% to a strong majority in the last cycle, while Entropy-scaled also increases Group 1 but plateaus lower ($\approx80\%$). RS remains intermediate, and Entropy shows the smallest and slowest Group 1 increase. The weak-bias dataset follows the same ordering.
% % We next evaluated the inverse imbalanced initial training set on the weak bias dataset.
% % All strategies start from the same seed set and therefore show identical performance at the first cycle (10 labeled):
% % I=0.844, $\Delta=0.0623$, and ESSP=0.794.
% % Despite the weaker morphological shift in this dataset, this initial training set setup still induces a noticeable early disparity, indicating that initialization bias can be a dominant driver of unfairness even when the intrinsic data bias is mild.
% % As labeling progressed, Entropy exhibited the least performance regarding equity. At 30 labeled volumes, it retained a large gap ($\Delta=0.0444$) and the lowest ESSP=0.828. In contrast, strategies that normalized uncertainty within the dilated left caudate stabilized equity. At 30 labeled, Entropy (scaled) reduced disparity to $\Delta=0.0129$ (ESSP=0.870), while the proposed Weighted Entropy (scaled) further improved fairness to $\Delta=0.00683$ with ESSP=0.873.
% % At later cycles, Weighted Entropy (scaled) provided a consistent additional fairness refinement over the Entropy (scaled). At 50 labeled, our method achieved $\Delta=0.00637$ and ESSP=0.884, compared to Entropy (scaled) with $\Delta=0.00830$ and ESSP=0.884. At convergence (86 labeled), Weighted Entropy (scaled) yielded the best equity-aware outcome with I=0.896, $\Delta=0.00373$, and ESSP=0.892. Entropy (scaled) remained competitive in accuracy (I=0.896) but with a larger disparity ($\Delta=0.00863$, ESSP=0.888). RS achieved a strong overall dice (I=0.897), yet performed worse concerning inequity ($\Delta=0.0176$, ESSP=0.882), while Entropy remained the weakest fairness baseline (I=0.887, $\Delta=0.0256$, ESSP=0.865). Overall, relative to Entropy at the final cycle, the proposed method reduced $\Delta$ by $\approx 85\%$ (0.00373 vs.\ 0.0256) and improved ESSP by $\approx 0.027$.
% % Comparing these results to the strong bias dataset under the same initialization reveals the same ranking of strategies, though the absolute magnitude of disparities differs. In the strong bias case, the initial disparity was much higher ($\Delta$=0.106 vs.\ 0.062), and Entropy led to a degradation in fairness (increasing $\Delta$ to 0.132 by cycle 30). In the weak bias case, Entropy did not degrade fairness but was inefficient at improving it compared to other methods. Critically, the proposed Weighted Entropy (scaled) proved robust in both scenarios. In the strong bias regime, it reduced the final disparity to $\Delta=\approx0.025$ (vs.\ RS $\approx$0.048). In the weak bias regime, it nearly eliminated disparity, reaching $\Delta\approx0.004$ (vs.\ RS 0.018). Overall, relative to Entropy (scaled), the weighted approach reduced the final disparity by approximately 55\% (0.004 vs.\ 0.009), demonstrating that group-aware reweighting provides an impactful correction that Entropy (scaled) alone cannot achieve.



% \begin{figure}[htbp]
% \centering

% % 1. Remove padding between columns
% \setlength{\tabcolsep}{0pt} 
% \renewcommand{\arraystretch}{0.5} % Reduce vertical gap between images if needed

% % 2. Define columns: @{} removes side margins. 
% % We use 'c' (center) to let the image define the width, or 'p' to force width.
% % Using 0.5\textwidth allows the images to touch.
% \begin{tabular}{ @{} p{0.33\textwidth} @{} p{0.33\textwidth} @{} p{0.33\textwidth} @{}}
    
%     % --- Row 1: Dice ---
%     \mytopplot{images_modified/strong_ESSP_Bal.png} &
%     \mytopplot{images_modified/strong_delta_Bal.png} &
%     \mytopplot{images_modified/strong_dice_Bal.png}
%     \tabularnewline

%     \myplot{images_modified/strong_ESSP_biased80.png} &
%     \myplot{images_modified/strong_delta_biased80.png} &
%     \myplot{images_modified/strong_dice_biased80.png}
%     \tabularnewline

%     \myplot{images_modified/strong_ESSP_unbias80.png} &
%     \myplot{images_modified/strong_delta_unbias80.png} &
%     \myplot{images_modified/strong_dice_unbias80.png}
%     \tabularnewline
% \end{tabular}


% \caption{Performance metrics under different initial training set compositions for the strong bias experiments. First row: balanced initialization, second row: 80/20 Group 1/Group 2 ratio, third row: 20/80 ratio. Left column: ESSP, middle column: $\Delta$, Right column: DSC.}
% \label{fig:strongbias}
% \end{figure}

% \begin{figure}[htbp]
% \centering

% % 1. Remove padding between columns
% \setlength{\tabcolsep}{0pt} 
% \renewcommand{\arraystretch}{0.5} % Reduce vertical gap between images if needed

% % 2. Define columns: @{} removes side margins. 
% % We use 'c' (center) to let the image define the width, or 'p' to force width.
% % Using 0.5\textwidth allows the images to touch.
% \begin{tabular}{ @{} p{0.33\textwidth} @{} p{0.33\textwidth} @{} p{0.33\textwidth} @{}}
    
%     % --- Row 1: Dice ---
%     \mytopplot{images_modified/weak_ESSP_Bal.png} &
%     \mytopplot{images_modified/weak_delta_Bal.png} &
%     \mytopplot{images_modified/weak_dice_Bal.png}
%     \tabularnewline

%     \myplot{images_modified/weak_ESSP_biased80.png} &
%     \myplot{images_modified/weak_delta_biased80.png} &
%     \myplot{images_modified/weak_dice_biased80.png}
%     \tabularnewline

%     \myplot{images_modified/weak_ESSP_unbiased80.png} &
%     \myplot{images_modified/weak_delta_unbiased80.png} &
%     \myplot{images_modified/weak_dice_unbiased80.png}
%     \tabularnewline
% \end{tabular}


% \caption{Performance metrics under different initial training set compositions for the weak bias experiments. First row: balanced initialization, second row: 80/20 Group 1/Group 2 ratio, third row: 20/80 ratio. Left column: ESSP, middle column: $\Delta$, Right column: DSC.}
% \label{fig:weakbias}
% \end{figure}




% \begin{figure}[htbp]
% \centering

% % Optimization settings:
% % Two columns only (no first column)
% \begin{tabular}{ >{\centering\arraybackslash}m{0.48\textwidth} >{\centering\arraybackslash}m{0.48\textwidth} }

% % Row 1
% % \myplot{images_modified/strong_delta_Bal.png} &
% % \includegraphics[width=0.99\linewidth, trim=10 30 30 30, clip]{images_modified/weak_delta_Bal.png} \\
% \myplot{images_modified/strong_delta_Bal.png} &
% \myplot{images_modified/weak_delta_Bal.png} \\

% % Row 2
% \myplot{images_modified/strong_delta_biased80.png} &
% \myplot{images_modified/weak_delta_biased80.png} \\

% % Row 3
% \myplot{images_modified/strong_delta_unbias80.png} &
% \myplot{images_modified/weak_delta_unbiased80.png} \\

% \end{tabular}

% \caption{$\mathbf{\Delta}$ under different initial training set compositions. First row: balanced initialization, second row: 80/20 Group 1/Group 2 ratio, third row: 20/80 ratio. Left column: strong bias dataset, Right column: weak bias dataset.}
% \label{tab:delta}
% \end{figure}




% \begin{figure}[htbp]
% \centering

% % 1. Remove padding between columns
% \setlength{\tabcolsep}{0pt} 
% \renewcommand{\arraystretch}{0.5} % Reduce vertical gap between images if needed

% % 2. Define columns: @{} removes side margins. 
% % We use 'c' (center) to let the image define the width, or 'p' to force width.
% % Using 0.5\textwidth allows the images to touch.
% \begin{tabular}{ @{} p{0.5\textwidth} @{} p{0.5\textwidth} @{} }
    
%     % --- Row 1: Dice ---
%     \myplot{images_modified/strong_dice_Bal.png} &
%     \myplot{images_modified/weak_dice_Bal.png} \tabularnewline

%     \myplot{images_modified/strong_dice_biased80.png} &
%     \myplot{images_modified/weak_dice_biased80.png} \tabularnewline

%     \myplot{images_modified/strong_dice_unbias80.png} &
%     \myplot{images_modified/weak_dice_unbiased80.png} \tabularnewline
% \end{tabular}


% \caption{\textbf{Raw DSC} under different initial training set compositions. First row: balanced initialization, second row: 80/20 Group 1/Group 2 ratio, third row: 20/80 ratio. Left column: strong bias dataset, Right column: weak bias dataset.}
% \label{tab:dice}
% \end{figure}



% \subsection{Balanced initial training set}
% \paragraph{Strong Bias Dataset Results.}
% In terms of ESSP, the entropy (scaled) and weighted entropy (scaled) are indistinguishable during the first six cycles, but the proposed group-aware reweighting becomes beneficial as more labeled data are selected.
% The Weighted Entropy (scaled) method proved superior, achieving the best fairness metrics (ESSP=0.884, DSC=0.90, $\Delta$=0.0176). The Entropy (scaled) method was the second fairest, though its slightly higher raw accuracy was offset by a larger disparity (ESSP\,=\,0.879, DSC\,=\,0.905, $\Delta$\,=\,0.0287). Random Sampling achieved the highest raw accuracy in the final cycle and through the all cycles but was less equitable (ESSP=0.866, $\Delta$=0.0475, DSC=0.907). Entropy performed the worst ($\Delta$=0.0692, ESSP=0.836, DSC=0.894). 

% % In the balanced initial training set experiment, all methods start from the same seed set and thus exhibit identical performance at the first cycle (10 labeled): DSC=0.864, $\Delta=0.069$, and ESSP=0.809. 
% % As labeling increases, localized entropy and weighted localized entropy strategies separate from global entropy sampling. 
% % % In the following, the numbers mentioned are the mean over 5 different experiments differing in the initial training set and seeds.
% % By 30 labeled volumes, both localized entropy and weighted localized entropy achieve a substantial reduction in disparity ($\Delta=0.027$) compared to global entropy ($\Delta=0.080$), yielding higher ESSP (0.860 vs. 0.813). 
% % % This indicates that restricting uncertainty to the dilated left caudate and normalizing by the structure's size helps curb early-cycle unfairness.

% % While localized entropy and weighted localized entropy are indistinguishable during the first six cycles, the proposed group-aware reweighting becomes beneficial as more labeled data are selected.
% % From 34 labeled volumes onward, weighted localized entropy consistently achieves lower $\Delta$ and 
% % at the final cycle (86 labeled), weighted localized entropy achieves the best fairness metrics with DSC=0.900, $\Delta=0.0176$, and ESSP=0.884. 
% % % In comparison, Entropy (scaled) attains higher overall accuracy (I=0.905) but a larger disparity ($\Delta=0.0287$), resulting in a lower ESSP (0.879). 
% % Random sampling remains a competitive strategy, especially in the early cycles and high overall accuracy (DSC=0.907) but remains less equitable ($\Delta=0.0475$, ESSP=0.866).
% % % , while Entropy have the weakest performance in terms of fairness  throughout training ($\Delta=0.0692$, ESSP=0.836 at 86 labeled). 

% % Overall, relative to Entropy at the final cycle, Weighted Entropy (scaled) reduces $\Delta$ by approximately 75\% (0.0176 vs. 0.0692) while improving ESSP by 0.049 (0.884 vs. 0.836). 

% \paragraph{Weak Bias Dataset Results}
% In the evaluation of the weak bias dataset with a balanced training set, we observed that while the reduced morphological variation lowers performance gaps, the hierarchy of strategy effectiveness remains similar to the strong bias experiment. At the final cycle, weighted entropy (scaled) proved to be the fairest strategy (ESSP=0.893, $\Delta$=0.0031, DSC=0.896), followed by Entropy (scaled) (ESSP=0.889, $\Delta$=0.0082, DSC=0.897) and a substantial improvement over Random Sampling (ESSP=0.882, $\Delta$=0.0172, DSC=0.897) and standard Entropy (ESSP=0.870, $\Delta$=0.0218, DSC=0.889). Crucially, the results revealed that the proposed method is even more effective here than in the strong bias dataset: while it reduced disparity by $75\%$ in the strong-bias case, it achieved an approximate $86\%$ reduction relative to Entropy in this weak bias context. We refer the reader to Tables \ref{tab:ESSP} (ESSP), \ref{tab:delta} ($\Delta$), and \ref{tab:dice} (DSC) for the full behavior over the 19 AL cycles for both strong and weak bias scenarios.



% Figure \ref{tab:ratio} shows the percentage of Group 1 samples in the current labeled dataset. This helps us understand how the various AL strategies favor one group or another. As expected, random sampling balances groups equally, whereas standard entropy gradually reduces Group 1 to below 50\%. In contrast, Weighted Entropy scaled rapidly shifts the training set toward Group 1, rising from ~50\% to 90\% by late cycles, while Entropy scaled also enriches Group 1 but stabilizes lower (mid-to-high 80\%).



% % We evaluated a balanced active learning initialization (5 Group 1 and 5 Group 2 labeled cases) on the weak bias dataset.

% % All AL strategies start from a fair point in the balanced setup. At the first cycle (10 labeled), all methods share identical performance with $I=0.852$, $\Delta=0.0238$, and $ESSP=0.833$.

% % By 30 labeled, methods that restrict uncertainty to the dilated left caudate already demonstrate a clearer fairness advantage. Entropy (scaled) reduces disparity to $\Delta=0.0110$ with ESSP=0.872, while the proposed Weighted Entropy (scaled) further lowers disparity to $\Delta=0.0068$ with ESSP=0.875. In contrast, naive entropy remains less stable ($\Delta=0.0180$, ESSP=0.858), and random sampling shows no consistent early fairness benefit ($\Delta=0.0236$).

% % At the final cycle (86 labeled), the weak-bias results confirm that fairness is easier to maintain overall, because the weak-bias dataset induces smaller morphology-driven performance gaps, disparity remains low across strategies, making fairness easier to preserve. This is reflected in consistently smaller $\Delta$ values at convergence compared to the strong-bias setting. However, our method provides the most favorable result in terms of equity. Weighted Entropy (scaled) achieves I=0.896, $\Delta=0.0031$, and ESSP=0.893, outperforming Entropy (scaled) (I=0.897, $\Delta=0.0082$, ESSP=0.889), RS (I=0.897, $\Delta=0.0172$, ESSP=0.882), and Entropy (I=0.889, $\Delta=0.0218$, $ESSP=0.870$). Relative to Entropy, the proposed strategy reduces the final disparity by $\approx 86\%$ (0.0031 vs.\ 0.0218) while improving ESSP by $\approx 0.023$. Relative to the entropy (scaled), our method still yields a fairness refinement, cutting $\Delta$ by $\approx 63\%$ (0.0031 vs.\ 0.0082).

% % % \textbf{Comparison with strong bias.}
% % When comparing with the balanced-seed experiment on the strong bias dataset, the same ranking pattern is preserved but with larger fairness gaps. With the strong bias dataset, Weighted Entropy (scaled) reduced the final disparity by approximately 75\% relative to Entropy ($\Delta \approx 0.018$ vs. 0.069). On the weak bias dataset, the proposed method achieves an even more profound relative reduction, lowering $\Delta$ by roughly 86\% compared to Entropy ($0.003$ vs. $0.022$). While the weak bias dataset presents a less severe initial fairness challenge, Entropy still fails to close the gap, whereas the weighted entropy (scaled) succeeds in neutralizing the bias almost entirely.


% \begin{table}[htbp]
% \centering
% \label{tab:group_performance_summary}
% \caption{Segmentation performance (dice) stratified by training cohort and evaluated on group 1, group 2, and pooled test sets of the weak bias dataset. ESSP and $\Delta$ are reported for each training condition.}
% \begin{tabular}{lccccc}
% \hline
% \textbf{Training} &
% \textbf{DSC group 1} &
% \textbf{DSC group 2} &
% \textbf{DSC all} &
% \textbf{ESSP} &
% \textbf{$\Delta$} \\
% \hline
% All        & 0.90 & 0.91 & 0.90 & 0.89 & 0.01 \\
% Group 1     & 0.90 & 0.90 & 0.90 & 0.90 & 0.002 \\
% Group 2  & 0.86 & 0.91 & 0.88 & 0.84 & 0.04 \\
% \hline
% \end{tabular}
% \end{table}