\section{Methods}

Our framework, visualized in Figure~\ref{fig:method}, encompasses image denoising, downstream task evaluation, fairness assessment, and bias mitigation for medical image reconstruction. The framework uses classification and segmentation models to estimate the effect of reconstruction on downstream task performance and fairness. Additionally, mitigation strategies are applied exclusively at the reconstruction stage to determine their ability to reduce downstream biases without retraining diagnostic models. %The code will be made publicly available upon publication.


\subsection{Datasets}
We apply our framework to public datasets from two distinct imaging domains:

\paragraph{MRI:} UCSF-PDGM includes 501 pre-operative glioma MRI exams from patients with diffuse glioma, along with tumor masks and labels for subtype and grade~\cite{Calabrese_2022}. We use the T2-weighted FLAIR volumes for all analyses. 

\paragraph{X-Ray:} CheXpert comprises 224,316 radiographs from 65,240 patients annotated for 14 thoracic findings~\cite{CheXpert}, of which we use 12 (excluding ``Support Devices'' and ``No Findings'' to focus on disease pathologies).  

We use a 70/10/20 train/validation/test split stratified by patient for both datasets. For CheXpert, the training set is further divided into non-overlapping sets for reconstruction and classification model training, with percentages of 70/30, respectively. For UCSF-PDGM, the same training data is for both tasks given smaller sample size. 
Group-wise fairness is assessed for age (dichotomized at the dataset median), sex, and self-reported race (unavailable for UCSF-PDGM). Detailed attribute distributions are reported in Tables~\ref{tab:chex_dataset} and~\ref{tab:ucsf_dataset} in the Appendix.

\subsection{Noising Process}
We simulate realistic acquisition degradations as follows:
\paragraph{MRI:} $k$-space data is masked with radial undersampling patterns~\cite{FengRadial} at acceleration factors 4, 8, and 16, where higher acceleration means greater undersampling (see Appendix \ref{app:mri}). 
\paragraph{X-Ray:} Standard-dose images are Radon-projected to sinogram space, bow-tie filtered, and corrupted with Poisson noise parameterized by photon count ($100{,}000$, $10{,}000$, $3{,}000$), with lower photon count yielding more noise ~\cite{Gibson2023APM}. 

These ranges approximate realistic acquisition conditions, with examples in the Appendix (Figures \ref{fig:noise-chex}, \ref{fig:noise-ucsf}, \ref{fig:chex-example}, and \ref{fig:ucsf-example}).

\subsection{Models}
We employ three reconstruction models alongside task-specific diagnostic models. Additional information on the compute infrastructure and model hyperparameters can be found in the Appendix.

\paragraph{Reconstruction:}
To cover deterministic, adversarial, and diffusion regimes, we train from scratch a standard U-Net~\cite{Unet} with MSE loss, a Pix2Pix GAN~\cite{pix2pix2017}, and a Stochastic Differential Equations (SDE)-based diffusion model~\cite{sde} for each dataset. We note that the GAN and diffusion models also use a U-Net as the model architecture, but are based on a different training paradigm.

\paragraph{Diagnostic:}
For classification on UCSF-PDGM, an ImageNet-initialized ResNet50~\cite{resnet} was trained separately to predict WHO grade and tumor type. The model is trained at the slice-level, and at testing, volume-level predictions are performed individually on each slice and then aggregated using the median. For CheXpert classification, a single ImageNet-initialized DenseNet model~\cite{densenet} was trained to jointly predict the 12 findings following~\citet{Cohen2021TorchXRayVisionAL}. For segmentation on UCSF-PDGM, we use an ImageNet-initialized U-Net. Segmentation is not evaluated on CheXpert due to the absence of masks. All downstream models are trained on the original, non-degraded images.

\subsection{Performance and Fairness Evaluation}
Reconstruction quality is measured by PSNR. Downstream performance uses AUROC for classification and Dice for segmentation. For classification fairness, we report the worst-case Equalized Odds (EODD) \cite{DBLP:journals/corr/HardtPS16} difference between groups:
\begin{align*}
max_{i,j}|P&(\hat{Y} = 1 \mid Y = y, A = a_i) \\&- P(\hat{Y} = 1 \mid Y = y, A = a_j)|,  \quad \forall y \in \{0,1\}, \\& \quad 
\forall \quad \text{attribute A} \in \mathcal{A}, \quad \text{subgroups} \quad a_i.
\end{align*} To compute this metric, model predictions are binarized using a balanced threshold selected to achieve approximately equal sensitivity and specificity in the validation split. Equality of Opportunity (EOP) results are also reported in the Appendix (Figures \ref{fig:bias_chexpert_eop} and \ref{fig:bias_ucsf_class_eop}).

For segmentation fairness, we adapt the Skewed-Error Ratio (SER)~\cite{SiddiquiFairSeg} to Dice:
\[
SER_A=\frac{\max_{i}(1-\text{Dice}_{a_i})}{\min_{j}(1-\text{Dice}_{a_j})}, \quad a_i \in A, \quad \text{A} \in \mathcal{A}
\]
Results using an unnormalized Dice difference are also provided in the Appendix (Figure \ref{fig:bias_ucsf_class_eop}).

%\subsubsection{Statistical Analysis.}
Statistical comparisons of subgroup fairness differences were performed using bootstrapped estimates with 1,000 iterations. Bootstrap-derived p-values were used to determine statistical significance with a two-sided \textit{p} $<$ 0.05.

\subsection{Bias Mitigation}
We adapt two bias mitigation strategies that were originally proposed for classification models. Each approach involves fine-tuning only the reconstruction models after the original training described above. The differentiable equalized-odds approach also relies on using the reconstruction and classification models applied in tandem, but the classification network is frozen to exclusively assess the potential for bias mitigation at the reconstruction stage. 

\paragraph{Sample Reweighting:}
A weighted sampler draws each example with inverse joint subgroup frequency during fine-tuning, ensuring that each subgroup (and combination thereof across attributes) is represented with the same frequency. The reconstruction model is fine-tuned using the corresponding original reconstruction loss. 
% A weighted sampler $p_{x_i}$ draws each example with inverse joint subgroup frequency \(n_{(a_{x_i}, b_{x_i},\dots)}\) during fine-tuning, ensuring that each subgroup (and combination thereof across attributes) is represented with the same frequency:
% $
% p_{x_i}=\frac{1/n_{(a_{x_i},b_{x_i},\dots)}}{\sum_{j=1}^{n}1/n_{(a_{x_i},b_{x_i},\dots)}}
% %\quad 
% %\mathcal{L}_{\mathrm{RE}}=\sum_{i=1}^{n}\|x - \hat{x}\|_2^2.
% $

\paragraph{Differentiable Equalized-Odds:} For reconstruction output \(\hat x=f(x)\) and classifier output \(\hat y=g(\hat x)\) we minimize:  
$
\mathcal{L}_{\mathrm{EODD}}
=\ell_{\mathrm{rec}}(\hat{x})
+\lambda_{\mathrm{fair}}\,
\mathrm{EMA}\!\bigl(\ell_{\mathrm{BCE}}(\hat{y})+\mathrm{EODD}^{2}\bigr),
$
%with 
%\[
%\ell_{\mathrm{rec}} = \|x - \hat{x}\|_2^2,\quad
%\ell_{\mathrm{CE}} = -\sum_{c=1}^{C} y_c \log \hat{y}_c.
%\] 
where $\ell_{\mathrm{rec}}$ represents the original reconstruction loss for the model, $\ell_{\mathrm{BCE}}$ represents the binary cross-entropy loss for the frozen classifier, $\text{EMA}$ represents an exponential moving average, and EODD represents a differentiable Equalized Odds constraint inspired by~\citet{marcinkevičs2022debiasingdeepchestxray}. Specifically, we use the maximum EODD difference of any subgroup as defined above and compute it via soft predictions:
$
\tilde y = \sigma\!\bigl((\hat{y})-\tau)/T\bigr),
$
where the threshold \(\tau\) and temperature \(T\) are set at 0.5 and 0.3, respectively. One loss is computed across all sensitive attributes (i.e., the max EODD over age, sex, and race). In the Appendix, we show that minimizing $\mathrm{EODD}^2$ between subgroups corresponds to minimizing their covariance.

\paragraph{Code:} Available at \url{https://github.com/lotterlab/reconstruction_evaluation}
%\vspace{10pt}
%\newline
%Code will be made publicly available on Github upon publication and is included as supplemental material for review.
