\setcounter{secnumdepth}{1}
\appendix
\section{  }


%\subsubsection{Noising Process}\label{app:noising-process}

\begin{figure}[ht]
    \centering
    \includegraphics[width=\linewidth]{figs/chex-noise.pdf}
    \caption{X-Ray images with photon count 100,000, 10,000, 3,000.}
    \label{fig:noise-chex}
\end{figure}
\begin{figure}[ht]
    \centering
    \includegraphics[width=\linewidth]{figs/ucsf-noise.pdf}
    \caption{MRI images with acceleration 4, 8, 16.}
    \label{fig:noise-ucsf}
\end{figure}

\begin{figure}[ht]
    \centering
    \includegraphics[width=\linewidth]{figs/chex-recon.pdf}
    \caption{Reconstruction example from photon count 10,000 for the different models. Grad-CAM \cite{Grad-CAM} and logit score correspond to the lung lesion prediction of the pre-trained classifier, indicating similar predictions on the reconstructed images.}
    \label{fig:chex-example}
\end{figure}
\begin{figure}[ht]
    \centering
    \includegraphics[width=\linewidth]{figs/ucsf-recon.pdf}
    \caption{Reconstruction with corresponding segmentation and Dice score of an MRI image with acceleration 8 for the different models.}
    \label{fig:ucsf-example}
\end{figure}


\paragraph{Diagnostic Hyperparameters.}\label{app:hyperparams}
The segmentation network was optimized with Adam \cite{Kingma2014AdamAM} using a learning rate of 0.001 and a batch size of 8 without data augmentation for 20 training epochs. The training loss consisted of Dice and L1, equally weighted at 0.5 each.
The network used a sigmoid activation and a threshold of 0.5 was used at inference to compute the Dice performance.
The model was trained on a per-slice level using all available MRI slices.
At inference, Dice performance was computed using slices 60-130, as this range is representative of the regions where the ground truth masks appear and thus is more representative of performance. The Dice scores were computed separately for each slice, then averaged across slices per patient, followed by averaging across patients to compute final performance. 
For the UCSF‐PDGM ResNet classifiers, we trained for 20 epochs with a learning rate of 0.0001 and a batch size of 16 without augmentation.
Each task was treated as binary classification (subtype: glioblastoma vs not glioblastoma, grade: (II, III)  vs IV) using binary cross entropy loss.
All MRI slices were again used for training, followed by using slices 60-130 at inference. Prediction scores were generated separately for each slice, followed by computing a patient-level score as the median across slices to serve as input to patient-level AUROC calculations. The median across slices was used to improve robustness to outliers. All UCSF-PDGM diagnostic models were trained using images pre-processed using min-max normalization to the 0-1 range and resized to 256x256.
The CheXpert DenseNet classifier was trained using TorchXRayVision \cite{Cohen2021TorchXRayVisionAL}. 
The default image preprocessing was used, with an input size of 224x224 pixels and normalization to a range of -1024 to 1024. The model was trained without data augmentation for 50 epochs using the Adam optimizer with a learning rate of 1e-3 and a weight decay of 1e-5.

\paragraph{Reconstruction Hyperparameters.}
No data augmentation was applied to any of the reconstruction pipelines.
A U-Net was trained for 20 epochs on both UCSF-PDGM and CheXpert, using Adam with MSE loss, a learning rate of 0.001, and a batch size of 16.
The GAN (Pix2Pix) was trained for 200 epochs on each dataset with Adam, a learning rate 0.0002, and a batch size of 32 to compensate for the smaller data volume.
For the SDE model, we employed Adam with a learning rate of 0.0001, a cosine learning-rate schedule, and a batch size of 8; training ran for 40 epochs on CheXpert and 300 epochs on UCSF-PDGM.
We note that the number of epochs varied between models because the different approaches take longer to converge (e.g., GANs are inherently less stable than a standard MSE loss), but in each case, the final weights were selected via validation loss monitoring, consistent with standard practice.
During mitigation with the EODD-constraint, we employed \(\tau=0.5\) for the threshold, \(T=0.3\) for the temperature, and a momentum value of 0.1 for the EMA.
The remaining hyperparameters and architectural details were adopted unchanged from the original U-Net \cite{Unet}, Pix2Pix \cite{pix2pix2017}, and SDE \cite{sde} publications. Image pre-processing consisted of min-max normalization to the 0-1 range and resizing to 256x256 for all reconstruction models.

The models were trained on a single NVIDIA A40 or A100 GPU. The SDE model was computationally most expensive and needed a maximum of 48 hours to train from scratch. For all models, the final weights were chosen based on performance on the validation split during training.

\paragraph{MRI preprocessing and reconstruction details.}\label{app:mri}
UCSF-PDGM provides reconstructed single-channel images (no multi-coil raw k-space data or complex-valued images). As a result, no coil combination, coil compression, or sensitivity map estimation was performed. To simulate undersampled MRI acquisitions, reconstructed images were retrospectively transformed to synthetic k-space using a discrete Fourier transform, implicitly assuming zero phase. Radial undersampling masks \cite{FengRadial} were applied in k-space, and zero-filled reconstructions were obtained via inverse Fourier transform. The models were then trained in an image-to-image fashion to map the zero-filled images to the original reconstructed images. This pipeline was chosen to align with standard practice in MRI reconstruction studies when raw k-space is unavailable. The original UCSF-PDGM dataset was acquired using a 3.0 tesla scanner and a dedicated 8-channel head coil \cite{Calabrese_2022}. Two gadolinium-based contrast agents were used across the cohort: gadobutrol at a dose of 0.1 mL/kg and gadoterate at a dose of 0.2 mL/kg.

\paragraph{Proof of Proportionality.}\label{app:proof}
When the protected attribute A takes more than two categories (e.g., multiple races, genders, or age groups), we compare all pairs $a_i, a_j$ of subgroups. Then, we take the maximum of the pairwise disparities in true positive and false positive rates:

\begin{align*}
EODD &= \max_{1 \le i < j \le k}
\Bigl[
\;\bigl|P(\hat{Y}=1 \mid Y=1, A=a_i) \\
   \;&-\; P(\hat{Y}=1 \mid Y=1, A=a_j)\bigr| 
\\
\quad &+\;\bigl|P(\hat{Y}=1 \mid Y=0, A=a_i)\\ 
   \;&-\; P(\hat{Y}=1 \mid Y=0, A=a_j)\bigr| 
\Bigr]
\end{align*}

Each pairwise comparison is handled exactly as in the binary case by treating $a_i, a_j$ as $0,1$. Therefore, all the steps below—derived under a binary setup—apply pairwise to any two subgroups. Taking the maximum over these pairwise disparities then yields the multi-group measure.


\noindent This proof is based on the derivation by \cite{marcinkevičs2022debiasingdeepchestxray}, and adjusted for EODD.

\noindent EODD measures the disparity between subgroups in true positive rate (TPR) and false positive rate (FPR). In the binary case: 
\begin{align*}
    EODD &= P_{X,Y, A} (\hat{Y} = 1 | Y = 1, A=1)\\ &- P_{X,Y|A} (\hat{Y} = 1 | Y = 1, A=0) \\
    &+ P_{X,Y, A} (\hat{Y} = 1 | Y = 0, A=1)\\ &- P_{X,Y, A} (\hat{Y} = 1 | Y = 0, A=0)
\end{align*}
\noindent This can be expressed by the following proxy function. 
\begin{align}
    EODD &= \frac{\sum_{i=1}^{n} f_{\theta}(x_i) a_i y_i}{\sum_{i=1}^{n} a_i y_i} \\
    &- \frac{\sum_{i=1}^{n} f_{\theta}(x_i)(1-a_i)y_i}{\sum_{i=1}^{n} (1-a_i)y_i} \tag{1}\\
    &\quad + \frac{\sum_{i=1}^{n} f_{\theta}(x_i) a_i (1-y_i)}{\sum_{i=1}^{n} a_i (1-y_i)} \\
    &- \frac{\sum_{i=1}^{n} f_{\theta}(x_i) (1-a_i)(1-y_i)}{\sum_{i=1}^{n} (1-a_i)(1-y_i)} \tag{2}
\end{align}

\noindent To start, let's define the conditional covariance:
\begin{align}
    \text{cov}&(A, X | Y = y) =\\ &\mathbb{E}[(A - \mathbb{E}[A | Y = y]) (X - \mathbb{E}[X | Y = y]) | Y = y] \notag \\ 
    &= \mathbb{E}[AX | Y = y] - \mathbb{E}[A | Y = y] \mathbb{E}[X | Y = y] \tag{3}
\end{align}

\noindent We can use the law of total covariance to prove the validity:

\begin{align}
    \text{cov}(A, X) &= \mathbb{E} \Big[ \text{cov}(A, X | Y) \Big] \\ &+ \text{cov} \Big( \mathbb{E}[A | Y], \mathbb{E}[X | Y] \Big) \tag{4}
\end{align}

\noindent Expanding the first expectation term with (3):
\begin{align}
    \mathbb{E} [\text{cov}(A, X | Y)] &= \mathbb{E} \Big[ \mathbb{E}[AX | Y] - \mathbb{E}[A | Y] \mathbb{E}[X | Y] \Big] \notag \\
    &= \mathbb{E}[AX] - \mathbb{E}[\mathbb{E}[A | Y] \mathbb{E}[X | Y]] \tag{5}
\end{align}

\noindent Expanding the second covariance term:

\begin{align}
    \text{cov}(\mathbb{E}[X | Z], \mathbb{E}[Y | Z]) &= \mathbb{E}[\mathbb{E}[X | Z] \mathbb{E}[Y | Z]] \tag{6} \\& - \mathbb{E}[X] \mathbb{E}[Y] \tag{6}
\end{align}

\noindent Substituting (5) and (6) into (4):

\begin{align*}
    \text{cov}(X, Y) &= \mathbb{E}[XY] - \mathbb{E}[\mathbb{E}[X | Z] \mathbb{E}[Y | Z]] \\ &+ \mathbb{E}[\mathbb{E}[X | Z] \mathbb{E}[Y | Z]] - \mathbb{E}[X] \mathbb{E}[Y] \\
    &= \mathbb{E}[XY] - \mathbb{E}[X] \mathbb{E}[Y] \\
    &= \text{cov}(X, Y)
\end{align*}
\noindent We want to show that $\Delta_{OOD} \propto \widehat{\text{Cov}}(A, f_{\theta}(X) | Y=1) +  
\widehat{\text{Cov}}(A, f_{\theta}(X) | Y=0)$

\noindent Let $\sum_{i} a_i y_i = S_{AY}, \quad \sum_{i} a_i = S_A, \quad \sum_{i} y_i = S_Y.$ \\

\noindent \textbf{Expanding EODD}:

\noindent Expanding (1):

\begin{align*}
    &\frac{\sum_{i=1}^{N} f_{\theta}(x_i) a_i y_i}{\sum_{i=1}^{N} a_i y_i}
    - \frac{\sum_{i=1}^{N} f_{\theta}(x_i) (1 - a_i) y_i}{\sum_{i=1}^{N} y_i (1 - a_i) y_i} \\
    &= \frac{1}{S_{AY}} \sum_{i=1}^{N} f_{\theta}(x_i) a_i y_i
    - \frac{1}{S_Y - S_A} \sum_{i=1}^{N} f_{\theta}(x_i)  
    \\ &+ \frac{1}{S_Y - S_{AY}} \sum_{i=1}^{N} f_{\theta}(x_i) a_i y_i  \\
    &= \frac{S_Y}{S_{AY} (S_Y - S_{AY})} 
    \sum_{i=1}^{N} f_{\theta}(x_i) y_i a_i
    \\ &- \frac{1}{S_Y - S_{AY}} \sum_{i=1}^{N} f_{\theta}(x_i) y_i 
\end{align*}

\noindent Note that: 
\begin{align*}
    \widehat{\text{Cov}}&(A, f_{\theta}(X) | Y=1)
    \\ &= \frac{\sum_{i=1}^{n} f_{\theta}(x_i) a_i y_i}{\sum_{i=1}^{n} y_i} 
    \\ &- \frac{\sum_{i=1}^{n} a_i y_i}{\sum_{i=1}^{n} y_i}
    \frac{\sum_{i=1}^{n} f_{\theta}(x_i) y_i}{\sum_{i=1}^{n} y_i}  \\
    &= \frac{1}{S_Y} \sum_{i=1}^{n} f_{\theta}(x_i) a_i y_i 
    \\ &- \frac{S_{AY}}{S_Y^2} \sum_{i=1}^{n} f_{\theta}(x_i) y_i.
\end{align*}
Showing $(5) \propto \widehat{\text{Cov}}(A, f_{\theta}(X) | Y=1)\\ \quad
\text{with factor} \quad \frac{S_Y^2}{S_{AY} (S_Y - S_{AY})}$, independent of $f_{\theta}$.\\

\noindent Expanding (2):

\begin{align*}
    &\frac{\sum_{i=1}^{n} f_{\theta}(x_i) a_i (1-y_i)}{\sum_{i=1}^{n} a_i (1-y_i)} 
    \\ &- \frac{\sum_{i=1}^{n} f_{\theta}(x_i) (1-a_i)(1-y_i)}{\sum_{i=1}^{n} (1-a_i)(1-y_i)} \\
    &= \frac{N - S_Y}{(N - S_Y - S_A + S_{AY})(S_A - S_{AY})} 
    \sum_{i=1}^{N} f_{\theta}(x_i) a_i  \\
    &\quad - \frac{N - S_Y}{(N - S_Y - S_A + S_{AY})(S_A - S_{AY})} 
    \sum_{i=1}^{N} f_{\theta}(x_i) a_i y_i  \\
    &\quad - \frac{1}{N - S_Y - S_A + S_{AY}} 
    \sum_{i=1}^{N} f_{\theta}(x_i) y_i  \\
    &\quad - \frac{N}{N - S_Y - S_A + S_{AY}} 
    \sum_{i=1}^{N} f_{\theta}(x_i)
\end{align*}

\noindent Similarly: 
\begin{align*}
    \widehat{\text{Cov}}&(A, f_0(X) | Y = 0) \\ &=  
    \frac{\sum_{i=1}^{N} f_0(x_i) a_i (1 - y_i)}{\sum_{i=1}^{N} (1 - y_i)}
    \\ &- \frac{\sum_{i=1}^{N} a_i (1 - y_i)}{\sum_{i=1}^{N} (1 - y_i)}
    \cdot \frac{\sum_{i=1}^{N} f_0(x_i) (1 - y_i)}{\sum_{i=1}^{N} (1 - y_i)}
     \\
    &= \frac{1}{N - S_Y} \sum_{i=1}^{N} f_0(x_i) a_i
    \\&- \frac{N}{N - S_Y} \sum_{i=1}^{N} f_0(x_i) a_i y_i
    \notag \\
    &\quad - \frac{S_A - S_{AY}}{(N - S_Y)^2} \sum_{i=1}^{N} f_0(x_i)
    \\ &- \frac{S_A \cdot S_{AY}}{(N - S_Y)^2} \sum_{i=1}^{N} f_0(x_i) y_i
\end{align*}
Showing $(6) \propto \widehat{\text{Cov}}(A, f_{\theta}(X) | Y=0) \quad 
\text{with factor} \quad 
\frac{(S_A - S_{AY}) (N - S_Y - S_A + S_{AY})}{(N - S_Y)^2}$, independent of $f_{\theta}$.\\

\noindent Therefore, $EODD \propto \widehat{\text{Cov}}(A, f_{\theta}(X) | Y=1) +  
\widehat{\text{Cov}}(A, f_{\theta}(X) | Y=0)$.

\clearpage
%\subsection{Datasets}\label{app:datasets}

\begin{table*}[h]
\centering
\begin{tabular}{l|cccccc|c}
\hline
 & \textbf{AI/AN} & \textbf{Asian} & \textbf{Black} & \textbf{NH/PI} & \textbf{Other} & \textbf{White} &  \\
\hline
\textbf{Female, $> 62$} & 54  & 1539  & 923  & 314  & 2518  & 6456  & 11804 \\
\textbf{Female, $\leq 62$} & 39  & 1739  & 608  & 136  & 1710  & 9500  & 13732 \\
\textbf{Male, $> 62$} & 56  & 1734  & 1023  & 240  & 3553  & 8984  & 15590 \\
\textbf{Male, $\leq 62$} & 27  & 1924  & 539  & 171  & 1853  & 11170  & 15684 \\
\hline
\textbf{} & 176  & 6936  & 3093  & 861  & 9634  & 36110  & 56810 \\
\hline
\end{tabular}
\caption{Patient-wise groups used for analysis based on sex, age, and race for the CheXpert dataset. Unequally distributed with very few samples for American Indian or Alaska Native (AI/AN) and Native Hawaiian or Other Pacific Islander (NH/PI).}\label{tab:chex_dataset}
\end{table*}

\begin{table*}[h]
\centering
\begin{tabular}{l|cc|c}
\hline
 & \textbf{Male} & \textbf{Female} & \\
\hline
\textbf{$\leq 58$} & 155 & 92 & 147\\
\textbf{$> 58$} & 144 & 110 & 254\\
\hline
\textbf{} & 299 & 202 & 501\\
\hline
\end{tabular}
\caption{Patient distribution by sex and age for the UCSF-PDGM dataset. Patients under 58 and females represent minority groups.}\label{tab:ucsf_dataset}
\end{table*}


%\subsection{Performance Results Before Mitigation}\label{app:performance_plots_before}

\begin{figure}[h]
    \centering
    % Optional: Add shared legend on top if needed
    \includegraphics[width=0.6\linewidth]{plots/evaluation_performance/ucsf/ucsf-evaluation_performance_legend.pdf}
    \vspace{1em}

    \includegraphics[width=0.6\linewidth]{plots/evaluation_performance/ucsf/ucsf-evaluation_performance_tgrade_psnr.pdf}
    \vspace{1em} % space between legend and first plot

    \includegraphics[width=0.6\linewidth]{plots/evaluation_performance/ucsf/ucsf-evaluation_performance_ttype_psnr.pdf}

    \caption{Tumor Type and Tumor Grade and PSNR values for different noise levels on UCSF-PDGM. The image quality and diagnostic performance axes are on a similar percentage scale. Task performance metrics show high stability across models and noise conditions, while PSNR drops with increasing noise.}
    \label{fig:performance_ucsf}
\end{figure}

\begin{table*}[t]
\centering
\footnotesize
\begin{tabular}{cll|llll}
\hline
\multicolumn{1}{l}{\textbf{Photon Count}} & \multicolumn{2}{c|}{\textbf{Metrics}}         & \textbf{Baseline} & \textbf{U-Net} & \textbf{GAN} & \textbf{SDE} \\ \hline
\multirow{15}{*}{100,000}               & \multirow{13}{*}{\textbf{AUROC}} & \textbf{Atalectasis}  & 0.87 & 0.87 & 0.86 & 0.87 \\
                                 &                        & \textbf{Cardiomegaly} & 0.91 & 0.91 & 0.91 & 0.91 \\
                                 &                        & \textbf{Consolidation} & 0.91 & 0.91 & 0.91 & 0.91 \\ 
                                 &                        & \textbf{Edema} & 0.90 & 0.90 & 0.90 & 0.90 \\ 
                                 &                        & \textbf{EC} & 0.79 & 0.78 & 0.78 & 0.79 \\ 
                                 &                        & \textbf{Fracture} & 0.76 & 0.75 & 0.75 & 0.76 \\ 
                                 &                        & \textbf{Lung Lesion} & 0.80 & 0.79 & 0.79 & 0.79 \\ 
                                 &                        & \textbf{Lung Opacity} & 0.88 & 0.88 & 0.88 & 0.88 \\ 
                                 &                        & \textbf{Pleural Effusion} & 0.93 & 0.92 & 0.92 & 0.92 \\ 
                                 &                        & \textbf{Pleural Other} & 0.83 & 0.82 & 0.81 & 0.82 \\ 
                                 &                        & \textbf{Pneumonia} & 0.83 & 0.83 & 0.83 & 0.83 \\ 
                                 &                        & \textbf{Pneumothorax} & 0.77 & 0.75 & 0.76 & 0.77 \\
                                 &                        & \textbf{Average} & 0.85 & 0.84 & 0.84 & 0.85 \\ \cline{2-3}
                                 & \multicolumn{2}{l|}{\textbf{PSNR}}            &  & 31.60 & 30.16 & 29.98 \\
                                 & \multicolumn{2}{l|}{\textbf{LPIPS}}           &  & 0.13 & 0.08 & 0.08 \\ \hline
\multirow{15}{*}{10,000}               & \multirow{13}{*}{\textbf{AUROC}} & \textbf{Atalectasis}  & 0.87 & 0.87 & 0.86 & 0.87 \\
                                 &                        & \textbf{Cardiomegaly} & 0.91 & 0.90 & 0.90 & 0.91 \\
                                 &                        & \textbf{Consolidation} & 0.91 & 0.91 & 0.90 & 0.91 \\ 
                                 &                        & \textbf{Edema} & 0.90 & 0.89 & 0.89 & 0.90 \\ 
                                 &                        & \textbf{EC} & 0.79 & 0.78 & 0.78 & 0.78 \\ 
                                 &                        & \textbf{Fracture} & 0.76 & 0.75 & 0.74 & 0.75 \\ 
                                 &                        & \textbf{Lung Lesion} & 0.80 & 0.78 & 0.78 & 0.79 \\ 
                                 &                        & \textbf{Lung Opacity} & 0.88 & 0.88 & 0.87 & 0.88 \\ 
                                 &                        & \textbf{Pleural Effusion} & 0.93 & 0.92 & 0.91 & 0.92 \\ 
                                 &                        & \textbf{Pleural Other} & 0.83 & 0.81 & 0.80 & 0.82 \\ 
                                 &                        & \textbf{Pneumonia} & 0.83 & 0.82 & 0.82 & 0.82 \\ 
                                 &                        & \textbf{Pneumothorax} & 0.77 & 0.75 & 0.75 & 0.77 \\
                                 &                        & \textbf{Average} & 0.85 & 0.84 & 0.83 & 0.84 \\ \cline{2-3}
                                 & \multicolumn{2}{l|}{\textbf{PSNR}}            &  & 30.52 & 28.62 & 27.12 \\
                                 & \multicolumn{2}{l|}{\textbf{LPIPS}}           &  & 0.19 & 0.11 & 0.15 \\ \hline
\multirow{15}{*}{3000}               & \multirow{13}{*}{\textbf{AUROC}} & \textbf{Atalectasis}  & 0.87 & 0.86 & 0.85 & 0.86 \\
                                 &                        & \textbf{Cardiomegaly} & 0.91 & 0.90 & 0.90 & 0.91 \\
                                 &                        & \textbf{Consolidation} & 0.91 & 0.91 & 0.90 & 0.90 \\ 
                                 &                        & \textbf{Edema} & 0.90 & 0.89 & 0.89 & 0.89 \\ 
                                 &                        & \textbf{EC} & 0.79 & 0.78 & 0.78 & 0.78 \\ 
                                 &                        & \textbf{Fracture} & 0.76 & 0.74 & 0.73 & 0.75 \\ 
                                 &                        & \textbf{Lung Lesion} & 0.80 & 0.77 & 0.77 & 0.78 \\ 
                                 &                        & \textbf{Lung Opacity} & 0.88 & 0.87 & 0.87 & 0.87 \\ 
                                 &                        & \textbf{Pleural Effusion} & 0.93 & 0.91 & 0.91 & 0.92 \\ 
                                 &                        & \textbf{Pleural Other} & 0.83 & 0.80 & 0.78 & 0.81 \\ 
                                 &                        & \textbf{Pneumonia} & 0.83 & 0.82 & 0.80 & 0.82 \\ 
                                 &                        & \textbf{Pneumothorax} & 0.77 & 0.74 & 0.74 & 0.77 \\
                                 &                        & \textbf{Average} & 0.85 & 0.83 & 0.83 & 0.84 \\ \cline{2-3}
                                 & \multicolumn{2}{l|}{\textbf{PSNR}}            &  & 28.89 & 27.36 & 26.83 \\
                                 & \multicolumn{2}{l|}{\textbf{LPIPS}}           &  & 0.22 & 0.14 & 0.15 \\ \hline
\end{tabular}
\vspace{-10pt}
\caption{CheXpert performance across reconstruction models and photon counts.
Pathologies with lower baseline AUROC (e.g., fracture, pneumothorax, lung lesion) experience greater performance drops under noise compared to more easily detectable conditions (e.g., effusion, cardiomegaly). Baseline corresponds to original images.}\label{tab:chex_perf}
\end{table*}


\begin{table*}[t]
    \centering
    \begin{tabular}{cll|cccc}
    \hline
    \multicolumn{1}{l}{\textbf{Acceleration}} & \multicolumn{2}{c|}{\textbf{Metrics}}         & \textbf{Baseline} & \textbf{U-Net} & \textbf{GAN} & \textbf{SDE} \\ \hline
    \multirow{5}{*}{4}               & \multirow{2}{*}{\textbf{AUROC}} & \textbf{Tumor Type}  &    0.79      &   0.79   &  0.77   &     0.78     \\
                                     &                        & \textbf{Tumor Grade} &   0.73       &   0.76   &  0.71   &     0.72     \\ \cline{2-3}
                                     & \multicolumn{2}{l|}{\textbf{Dice}}            &  0.72        &   0.72   &  0.71   &     0.72     \\ \cline{2-3}
                                     & \multicolumn{2}{l|}{\textbf{PSNR}}            &         &   42.94   &  37.71   &     40.23     \\
                                     & \multicolumn{2}{l|}{\textbf{LPIPS}}           &          &   0.01   &  0.02   &     0.00     \\ \hline
    \multirow{5}{*}{8}               & \multirow{2}{*}{\textbf{AUROC}} & \textbf{Tumor Type}  &   0.79       &   0.77   &  0.83   &     0.79     \\
                                     &                        & \textbf{Tumor Grade} &   0.73       &   0.75   &  0.78   &     0.73     \\ \cline{2-3}
                                     & \multicolumn{2}{l|}{\textbf{Dice}}            &  0.72        &   0.70   &  0.71   &     0.71     \\ \cline{2-3}
                                     & \multicolumn{2}{l|}{\textbf{PSNR}}            &         &   35.77   &  35.20   &     34.65     \\
                                     & \multicolumn{2}{l|}{\textbf{LPIPS}}           &          &   0.03   &  0.02   &     0.02     \\ \hline
    \multirow{5}{*}{16}               & \multirow{2}{*}{\textbf{AUROC}} & \textbf{Tumor Type}  &   0.79       &   0.76   &  0.81   &     0.80     \\
                                     &                        & \textbf{Tumor Grade} &   0.73       &   0.70   &  0.74   &     0.69     \\ \cline{2-3}
                                     & \multicolumn{2}{l|}{\textbf{Dice}}            &  0.72        &   0.67   &  0.70   &     0.71     \\ \cline{2-3}
                                     & \multicolumn{2}{l|}{\textbf{PSNR}}            &         &   31.84   &  32.34   &     34.56     \\
                                     & \multicolumn{2}{l|}{\textbf{LPIPS}}           &          &   0.06   &  0.04   &     0.02     \\ \hline
    \end{tabular}
    \caption{Performance metrics for UCSF-PDGM across reconstruction models and noise levels. While PSNR varies with noise and model, downstream segmentation and classification metrics remain relatively stable, indicating robust task performance across conditions.}\label{tab:ucsf_perf}
    \end{table*}


% --- BASELINE FAIRNESS




% --- LAMBDA VALUES

%\subsection{Lambda Values}\label{app:lambda_values}
\begin{figure*}[t]
\centering

% Legend
\subfigure[]{%
\begin{minipage}[t]{\textwidth}
\centering
\includegraphics[width=0.3\textwidth]{plots/lambda_values/legend_wo_psnr.pdf}
\end{minipage}}

\begin{tabular}{c@{\hspace{0.2cm}}c@{\hspace{0.2cm}}c}

\subfigure[Age]{%
\begin{minipage}[t]{0.3\textwidth}
\centering
\includegraphics[width=\linewidth]{plots/lambda_values/eodd_lambda/eodd_lambda_age_auroc_fairness.pdf}
\end{minipage}}
&
\subfigure[Sex]{%
\begin{minipage}[t]{0.3\textwidth}
\centering
\includegraphics[width=\linewidth]{plots/lambda_values/eodd_lambda/eodd_lambda_gender_auroc_fairness.pdf}
\end{minipage}}
&
\subfigure[Race]{%
\begin{minipage}[t]{0.3\textwidth}
\centering
\includegraphics[width=\linewidth]{plots/lambda_values/eodd_lambda/eodd_lambda_ethnicity_auroc_fairness.pdf}
\end{minipage}}

\end{tabular}

    \caption{Influence of fairness weighting parameter ($\lambda_{\mathrm{fair}}$) on classifier AUROC performance and fairness metrics for the Equalized Odds (EODD) mitigation constraint, evaluated with U-Net on the CheXpert dataset. There is minor sensitivity of AUROC to lambda; fairness metrics show greater variance but minimal substantial improvement with increased $\lambda$.}
    \label{fig:eodd-lambda-auroc}
    \end{figure*}
    
    
    
    
    \begin{figure*}[t]
  \centering

\subfigure[]{%
\begin{minipage}[t]{\textwidth}
\centering
\includegraphics[width=0.3\textwidth]{plots/lambda_values/legend_wo_auroc.pdf}
\end{minipage}}

\begin{tabular}{c@{\hspace{0.2cm}}c@{\hspace{0.2cm}}c}

\subfigure[Age]{%
\begin{minipage}[t]{0.3\textwidth}
\centering
\includegraphics[width=\linewidth]{plots/lambda_values/eodd_lambda/eodd_lambda_age_psnr_fairness.pdf}
\end{minipage}}
&
\subfigure[Sex]{%
\begin{minipage}[t]{0.3\textwidth}
\centering
\includegraphics[width=\linewidth]{plots/lambda_values/eodd_lambda/eodd_lambda_gender_psnr_fairness.pdf}
\end{minipage}}
&
\subfigure[Race]{%
\begin{minipage}[t]{0.3\textwidth}
\centering
\includegraphics[width=\linewidth]{plots/lambda_values/eodd_lambda/eodd_lambda_ethnicity_psnr_fairness.pdf}
\end{minipage}}

\end{tabular}
    \caption{Impact of $\lambda_{\mathrm{fair}}$ on reconstruction quality (PSNR) compared to fairness for the EODD constraint mitigation. PSNR remains stable across lambda variations, while fairness shows slight variation without substantial improvement.}
    \label{fig:eodd-lambda-psnr}
    \end{figure*}
    
    



% --- Performance Plots

%\subsection{Performance Results After Mitigation}
\begin{figure*}[t]
\centering

% Shared legend
\includegraphics[width=0.5\linewidth]{plots/performance_change/performance_change_legend.pdf}
\vspace{0.5em}

% --- Reweighted ---
\subfigure[Reweighting]{%
\begin{minipage}[t]{\textwidth}
\centering
\includegraphics[width=0.45\linewidth]{plots/performance_change/performance_change_reweighted_ucsf.pdf}
\hspace{1em}
\includegraphics[width=0.45\linewidth]{plots/performance_change/performance_change_reweighted_chex.pdf}
\end{minipage}}
\vspace{1em}

% --- EODD ---
\subfigure[Equalized odds constraint]{%
\begin{minipage}[t]{\textwidth}
\centering
\includegraphics[width=0.45\linewidth]{plots/performance_change/performance_change_eodd_ucsf.pdf}
\hspace{1em}
\includegraphics[width=0.45\linewidth]{plots/performance_change/performance_change_eodd_chex.pdf}
\end{minipage}}
\vspace{1em}



    \caption{Change in prediction performance after applying bias mitigation techniques. Each row compares two datasets for a given method: (a) Reweighted sampling, (b) Equalized odds constraint. UCSF-PDGM experiences more performance degradation. However, all techniques show good stability in task performance, with few outliers in the UCSF-PDGM dataset.}
    \label{app:mitigation_performance}
\end{figure*}

\subsection{Additional Fairness Results}\label{app:fairness_results}
In addition to Equalized Odds and Skewed Error Ratio in the main text, we investigate two additional bias metrics: 

\paragraph{Equality of Opportunity (EOP):} 
\begin{align*}
P&(\hat{Y} = 1 \mid Y = 1, A = 0) \\ &= P(\hat{Y} = 1 \mid Y = 1, A = 1).
\end{align*}

We report the worst case Equality of Opportunity \cite{DBLP:journals/corr/HardtPS16} difference between groups
\begin{align*}
max_{i,j}|P&(\hat{Y} = 1 \mid Y = 1, A = i) \\&- P(\hat{Y} = 1 \mid Y = 1, A = j)|, \\& \quad 
\forall \quad \text{A} \in \mathcal{A}.
\end{align*}

EOP is a relaxation of EODD, requiring fairness only concerning the positive class (\(Y=1\)).

\paragraph{\(\Delta \text{Dice}\):}
Given the limited availability of dedicated segmentation fairness metrics, we also compute:
\[
\Delta \text{Dice} = \max_{i, j} \left|\text{Dice}_{A_i} - \text{Dice}_{A_j}\right|, A \in \mathcal{A}
\]
which represents the maximum difference in Dice across all protected subgroups \(\mathcal{A}\).

Plots containing the results of these additional evaluations can be found in Figure \ref{fig:bias_chexpert_eop} and \ref{fig:bias_ucsf_class_eop}.

Additionally, Figures \ref{fig:histogram_race} and \ref{fig:bias_chexpert_race} contain results using different race subgroups for CheXpert. Our original evaluations considered each of the original subgroups listed within the dataset (Table \ref{tab:chex_dataset}) when computing the fairness metrics. Given the small counts for the American Indian or Alaska Native and Native Hawaiian or Other Pacific Islander subgroups, leading to large error bars, we also computed these metrics when including these subgroups within the Other subgroup. 

% Fairness plots - CheXpert only (EOP)
\begin{figure*}[t]
\centering

% Legend only (no caption)
\includegraphics[width=\textwidth]{plots/fairness/eop/evaluation_midl_camera_fairness_legend.pdf}

\begin{tabular}{c@{\hspace{0.2cm}}c}

\subfigure[]{%
\begin{minipage}[t]{0.49\textwidth}
\centering
\includegraphics[width=\linewidth]{plots/fairness/eop/evaluation_midl_camera_fairness_atelectasis.pdf}
\end{minipage}} 
&
\subfigure[]{%
\begin{minipage}[t]{0.49\textwidth}
\centering
\includegraphics[width=\linewidth]{plots/fairness/eop/evaluation_midl_camera_fairness_cardiomegaly.pdf}
\end{minipage}}
\\[0.3cm]

\subfigure[]{%
\begin{minipage}[t]{0.49\textwidth}
\centering
\includegraphics[width=\linewidth]{plots/fairness/eop/evaluation_midl_camera_fairness_consolidation.pdf}
\end{minipage}}
&
\subfigure[]{%
\begin{minipage}[t]{0.49\textwidth}
\centering
\includegraphics[width=\linewidth]{plots/fairness/eop/evaluation_midl_camera_fairness_edema.pdf}
\end{minipage}}
\\[0.3cm]

\subfigure[]{%
\begin{minipage}[t]{0.49\textwidth}
\centering
\includegraphics[width=\linewidth]{plots/fairness/eop/evaluation_midl_camera_fairness_ec.pdf}
\end{minipage}}
&
\subfigure[]{%
\begin{minipage}[t]{0.49\textwidth}
\centering
\includegraphics[width=\linewidth]{plots/fairness/eop/evaluation_midl_camera_fairness_fracture.pdf}
\end{minipage}}
\\[0.3cm]

\subfigure[]{%
\begin{minipage}[t]{0.49\textwidth}
\centering
\includegraphics[width=\linewidth]{plots/fairness/eop/evaluation_midl_camera_fairness_lung-lesion.pdf}
\end{minipage}}
&
\subfigure[]{%
\begin{minipage}[t]{0.49\textwidth}
\centering
\includegraphics[width=\linewidth]{plots/fairness/eop/evaluation_midl_camera_fairness_lung-opacity.pdf}
\end{minipage}}
\\[0.3cm]

\subfigure[]{%
\begin{minipage}[t]{0.49\textwidth}
\centering
\includegraphics[width=\linewidth]{plots/fairness/eop/evaluation_midl_camera_fairness_pleural-effusion.pdf}
\end{minipage}}
&
\subfigure[]{%
\begin{minipage}[t]{0.49\textwidth}
\centering
\includegraphics[width=\linewidth]{plots/fairness/eop/evaluation_midl_camera_fairness_pleural-other.pdf}
\end{minipage}}
\\[0.3cm]

\subfigure[]{%
\begin{minipage}[t]{0.49\textwidth}
\centering
\includegraphics[width=\linewidth]{plots/fairness/eop/evaluation_midl_camera_fairness_pneumonia.pdf}
\end{minipage}}
&
\subfigure[]{%
\begin{minipage}[t]{0.49\textwidth}
\centering
\includegraphics[width=\linewidth]{plots/fairness/eop/evaluation_midl_camera_fairness_pneumothorax.pdf}
\end{minipage}}

\end{tabular}

\caption{Equality of opportunity (EOP) bias change pre- and post-mitigation compared to predictions on original images for CheXpert classification. Pre-mitigation, bias tends to increase slightly for sex; race exhibits high variance. Bias tends to decline slightly post-mitigation.}
\label{fig:bias_chexpert_eop}

\end{figure*}

% Fairness plots - UCSF-PDGM only (EOP)
\begin{figure*}[t]
\centering

\includegraphics[width=\textwidth]{plots/fairness/eop/evaluation_midl_camera_fairness_legend.pdf}

\begin{tabular}{ccc}

\subfigure[]{%
\begin{minipage}[t]{0.3\textwidth}
\centering
\includegraphics[width=\linewidth]{plots/fairness/eop/evaluation_midl_camera_fairness_dice.pdf}
\end{minipage}}
&
\subfigure[]{%
\begin{minipage}[t]{0.3\textwidth}
\centering
\includegraphics[width=\linewidth]{plots/fairness/eop/evaluation_midl_camera_fairness_tgrade.pdf}
\end{minipage}}
&
\subfigure[]{%
\begin{minipage}[t]{0.3\textwidth}
\centering
\includegraphics[width=\linewidth]{plots/fairness/eop/evaluation_midl_camera_fairness_ttype.pdf}
\end{minipage}}

\end{tabular}

\caption{Equality of opportunity (EOP) and $\Delta$ Dice bias change compared to predictions on original images pre- and post-mitigation for UCSF-PDGM classification and segmentation.}
\label{fig:bias_ucsf_class_eop}

\end{figure*}

%\subsection{Merging Minority Race Groups}
%\label{app:alt_race_grouping}
 \begin{figure}[t!]
 \centering
 \includegraphics[width=0.7\linewidth]{plots/histogram_modified_race/bias_change_histogram.pdf}
 \vspace{-10pt}
 \caption{Distribution of bias changes when using alternative race subgroups for CheXpert calculations.}
    \label{fig:histogram_race}
\end{figure}


% Fairness plots - CheXpert only
\begin{figure*}[!t]
\centering

% Legend only (no caption)
\includegraphics[width=\textwidth]{plots/fairness_modified_race/eodd/evaluation_midl_camera_race_remap_fairness_legend.pdf}

\begin{tabular}{c@{\hspace{0.2cm}}c}

% Row 1
\subfigure[]{%
\begin{minipage}[t]{0.49\textwidth}
\centering
\includegraphics[width=\linewidth]{plots/fairness_modified_race/eodd/evaluation_midl_camera_race_remap_fairness_atelectasis.pdf}
\end{minipage}
}
&
\subfigure[]{%
\begin{minipage}[t]{0.49\textwidth}
\centering
\includegraphics[width=\linewidth]{plots/fairness_modified_race/eodd/evaluation_midl_camera_race_remap_fairness_cardiomegaly.pdf}
\end{minipage}
}
\\[0.3cm]

% Row 2
\subfigure[]{%
\begin{minipage}[t]{0.49\textwidth}
\centering
\includegraphics[width=\linewidth]{plots/fairness_modified_race/eodd/evaluation_midl_camera_race_remap_fairness_consolidation.pdf}
\end{minipage}
}
&
\subfigure[]{%
\begin{minipage}[t]{0.49\textwidth}
\centering
\includegraphics[width=\linewidth]{plots/fairness_modified_race/eodd/evaluation_midl_camera_race_remap_fairness_edema.pdf}
\end{minipage}
}
\\[0.3cm]

% Row 3
\subfigure[]{%
\begin{minipage}[t]{0.49\textwidth}
\centering
\includegraphics[width=\linewidth]{plots/fairness_modified_race/eodd/evaluation_midl_camera_race_remap_fairness_ec.pdf}
\end{minipage}
}
&
\subfigure[]{%
\begin{minipage}[t]{0.49\textwidth}
\centering
\includegraphics[width=\linewidth]{plots/fairness_modified_race/eodd/evaluation_midl_camera_race_remap_fairness_fracture.pdf}
\end{minipage}
}
\\[0.3cm]

% Row 4
\subfigure[]{%
\begin{minipage}[t]{0.49\textwidth}
\centering
\includegraphics[width=\linewidth]{plots/fairness_modified_race/eodd/evaluation_midl_camera_race_remap_fairness_lung-lesion.pdf}
\end{minipage}
}
&
\subfigure[]{%
\begin{minipage}[t]{0.49\textwidth}
\centering
\includegraphics[width=\linewidth]{plots/fairness_modified_race/eodd/evaluation_midl_camera_race_remap_fairness_lung-opacity.pdf}
\end{minipage}
}
\\[0.3cm]

% Row 5
\subfigure[]{%
\begin{minipage}[t]{0.49\textwidth}
\centering
\includegraphics[width=\linewidth]{plots/fairness_modified_race/eodd/evaluation_midl_camera_race_remap_fairness_pleural-effusion.pdf}
\end{minipage}
}
&
\subfigure[]{%
\begin{minipage}[t]{0.49\textwidth}
\centering
\includegraphics[width=\linewidth]{plots/fairness_modified_race/eodd/evaluation_midl_camera_race_remap_fairness_pleural-other.pdf}
\end{minipage}
}
\\[0.3cm]

% Row 6
\subfigure[]{%
\begin{minipage}[t]{0.49\textwidth}
\centering
\includegraphics[width=\linewidth]{plots/fairness_modified_race/eodd/evaluation_midl_camera_race_remap_fairness_pneumonia.pdf}
\end{minipage}
}
&
\subfigure[]{%
\begin{minipage}[t]{0.49\textwidth}
\centering
\includegraphics[width=\linewidth]{plots/fairness_modified_race/eodd/evaluation_midl_camera_race_remap_fairness_pneumothorax.pdf}
\end{minipage}
}

\end{tabular}

\caption{Equalized odds bias change pre- and post-mitigation compared to predictions on original images for CheXpert when using alternative race subgroups.}
\label{fig:bias_chexpert_race}

\end{figure*}

\begin{figure}[ht]
    \centering

    % Legend
    \includegraphics[width=0.95\linewidth]{plots/rebuttal/performance/ucsf-evaluation_performance_legend_ssim.pdf}
    \vspace{1em}

    % Two subplots side by side
    \begin{tabular}{c@{\hspace{0.5cm}}c}
        \subfigure[]{\includegraphics[width=0.49\linewidth]{plots/rebuttal/performance/chex-evaluation_performance_average_ssim.pdf}} &
        \subfigure[]{\includegraphics[width=0.49\linewidth]{plots/rebuttal/performance/ucsf-evaluation_performance_dice_ssim.pdf}}
    \end{tabular}
    \vspace{-10pt}
    \caption{Downstream performance and SSIM at varying noise levels. Axes for SSIM and task performance are scaled to comparable percentage ranges. Baseline indicates performance on original images.}
    \label{fig:performance_ssim1}
\end{figure}

\begin{figure}[ht]
    \centering

    % Legend
    \includegraphics[width=0.95\linewidth]{plots/rebuttal/performance/ucsf-evaluation_performance_legend_ssim.pdf}
    \vspace{1em}

    % Two subplots side by side
    \begin{tabular}{c@{\hspace{0.5cm}}c}
        \subfigure[]{\includegraphics[width=0.49\linewidth]{plots/rebuttal/performance/ucsf-evaluation_performance_ttype_ssim.pdf}} &
        \subfigure[]{\includegraphics[width=0.49\linewidth]{plots/rebuttal/performance/ucsf-evaluation_performance_tgrade_ssim.pdf}}
    \end{tabular}
    \vspace{-10pt}
    \caption{Downstream performance and SSIM at varying noise levels on classification tasks in UCSF-PDGM. Axes for SSIM and task performance are scaled to comparable percentage ranges. Baseline indicates performance on original images.}
    \label{fig:performance_ssim2}
\end{figure}


\begin{figure}[ht]
    \centering

    % % Legend
    % \includegraphics[width=0.8\linewidth]{plots/rebuttal/performance/ucsf-evaluation_performance_legend_ssim.pdf}
    % \vspace{1em}

    % Three subplots stacked vertically
    \begin{tabular}{c}
        \subfigure[]{\includegraphics[width=0.5\linewidth]{plots/rebuttal/histogram/bias_change_histogram_gender.pdf}} \\[0.5cm]
        \subfigure[]{\includegraphics[width=0.5\linewidth]{plots/rebuttal/histogram/bias_change_histogram_age.pdf}} \\[0.5cm]
        \subfigure[]{\includegraphics[width=0.5\linewidth]{plots/rebuttal/histogram/bias_change_histogram_ethnicity.pdf}}
    \end{tabular}

    \vspace{-10pt}
    \caption{Distribution of bias changes separated by sensitive attribute.}
    \label{fig:alt_bias_histogram}
\end{figure}