\section{Results}
We first evaluate the impact of reconstruction on downstream task performance before analyzing fairness and the effectiveness of mitigation techniques.

\begin{figure}[ht]
    \centering

    % Legend
    \includegraphics[width=0.95\linewidth]{plots/evaluation_performance/ucsf/ucsf-evaluation_performance_legend.pdf}
    \vspace{1em}

    % Two subplots side by side
    \begin{tabular}{c@{\hspace{0.5cm}}c}
        \subfigure[]{\includegraphics[width=0.49\linewidth]{plots/evaluation_performance/chex/chex-evaluation_performance_average_psnr.pdf}} &
        \subfigure[]{\includegraphics[width=0.49\linewidth]{plots/evaluation_performance/ucsf/ucsf-evaluation_performance_dice_psnr.pdf}}
    \end{tabular}
    \vspace{-10pt}
    \caption{Downstream performance and PSNR at varying noise levels. Axes for PSNR and task performance are scaled to comparable percentage ranges. Although PSNR declines as noise increases, task performance remains stable. Baseline indicates performance on original images.}
    \label{fig:performance}
\end{figure}

\subsection{Impact of Reconstruction on Task Performance}

Figure~\ref{fig:performance} summarizes downstream performance as a function of reconstruction noise. We report segmentation Dice for UCSF-PDGM and the mean AUROC across the 12 CheXpert pathologies. For clarity, the y-axes for PSNR and the task metrics are normalized to the same percentage range. Across all experiments, diagnostic performance remains largely unchanged, even though PSNR decreases substantially with increasing noise. Specifically, the Dice score for UCSF-PDGM segmentation varies by no more than~$\sim\!3\,\%$ across noise conditions, and the mean CheXpert AUROC fluctuates by only~$1\,\%$. In contrast, PSNR decreases by over $10$ dB ($26\,\%$) for UCSF-PDGM and by~$\sim\!3$ dB ($9\,\%$) for CheXpert. Analogous results for UCSF-PDGM classification are presented in the Appendix (Figure~\ref{fig:performance_ucsf}), where the same pattern—substantial PSNR loss but minimal impact on task performance—holds for all three reconstruction models. Using SSIM instead of PSNR as the reconstruction metric also shows similar trends (Appendix Figure \ref{fig:performance_ssim1}).

A closer look at CheXpert reveals a mild dependence on baseline task difficulty: pathologies with lower initial AUROC show slightly larger declines. For example, consolidation remains stable with U-Net reconstruction AUROC at~0.91, whereas lung lesion drops from~0.79 to~0.77 as noise increases (see Appendix Table~\ref{tab:chex_perf} for more details).

\subsection{Impact of Reconstruction on Fairness}
Although aggregate task performance is largely unaffected, reconstruction could still alter relative performance across demographic subgroups. To test this possibility, we evaluated fairness on the downstream models using acceleration factor 8 for UCSF-PDGM and a photon count of 10,000 for CheXpert, representing the middle noise levels.

 \begin{figure}[t!]
 \centering
 \includegraphics[width=0.7\linewidth]{plots/histogram/bias_change_histogram.pdf}
 \vspace{-10pt}
 \caption{Distribution of bias changes (percent change compared to original images) across all reconstruction models, datasets, and tasks, stratified by sensitive attribute. The vertical lines mark the medians. Most shifts cluster near zero, but sex shows a broader positive tail. Separate plots by sensitive attribute are contained in Fig. \ref{fig:alt_bias_histogram}.}
    \label{fig:histogram}
\end{figure}

% Fairness plots - CheXpert only
\begin{figure*}[!t]
\centering

% Legend only (no caption)
\includegraphics[width=\textwidth]{plots/fairness/eodd/evaluation_midl_camera_fairness_legend.pdf}

\begin{tabular}{c@{\hspace{0.2cm}}c}

% Row 1
\subfigure[]{%
\begin{minipage}[t]{0.49\textwidth}
\centering
\includegraphics[width=\linewidth]{plots/fairness/eodd/evaluation_midl_camera_fairness_atelectasis.pdf}
\end{minipage}
}
&
\subfigure[]{%
\begin{minipage}[t]{0.49\textwidth}
\centering
\includegraphics[width=\linewidth]{plots/fairness/eodd/evaluation_midl_camera_fairness_cardiomegaly.pdf}
\end{minipage}
}
\\[0.3cm]

% Row 2
\subfigure[]{%
\begin{minipage}[t]{0.49\textwidth}
\centering
\includegraphics[width=\linewidth]{plots/fairness/eodd/evaluation_midl_camera_fairness_consolidation.pdf}
\end{minipage}
}
&
\subfigure[]{%
\begin{minipage}[t]{0.49\textwidth}
\centering
\includegraphics[width=\linewidth]{plots/fairness/eodd/evaluation_midl_camera_fairness_edema.pdf}
\end{minipage}
}
\\[0.3cm]

% Row 3
\subfigure[]{%
\begin{minipage}[t]{0.49\textwidth}
\centering
\includegraphics[width=\linewidth]{plots/fairness/eodd/evaluation_midl_camera_fairness_ec.pdf}
\end{minipage}
}
&
\subfigure[]{%
\begin{minipage}[t]{0.49\textwidth}
\centering
\includegraphics[width=\linewidth]{plots/fairness/eodd/evaluation_midl_camera_fairness_fracture.pdf}
\end{minipage}
}
\\[0.3cm]

% Row 4
\subfigure[]{%
\begin{minipage}[t]{0.49\textwidth}
\centering
\includegraphics[width=\linewidth]{plots/fairness/eodd/evaluation_midl_camera_fairness_lung-lesion.pdf}
\end{minipage}
}
&
\subfigure[]{%
\begin{minipage}[t]{0.49\textwidth}
\centering
\includegraphics[width=\linewidth]{plots/fairness/eodd/evaluation_midl_camera_fairness_lung-opacity.pdf}
\end{minipage}
}
\\[0.3cm]

% Row 5
\subfigure[]{%
\begin{minipage}[t]{0.49\textwidth}
\centering
\includegraphics[width=\linewidth]{plots/fairness/eodd/evaluation_midl_camera_fairness_pleural-effusion.pdf}
\end{minipage}
}
&
\subfigure[]{%
\begin{minipage}[t]{0.49\textwidth}
\centering
\includegraphics[width=\linewidth]{plots/fairness/eodd/evaluation_midl_camera_fairness_pleural-other.pdf}
\end{minipage}
}
\\[0.3cm]

% Row 6
\subfigure[]{%
\begin{minipage}[t]{0.49\textwidth}
\centering
\includegraphics[width=\linewidth]{plots/fairness/eodd/evaluation_midl_camera_fairness_pneumonia.pdf}
\end{minipage}
}
&
\subfigure[]{%
\begin{minipage}[t]{0.49\textwidth}
\centering
\includegraphics[width=\linewidth]{plots/fairness/eodd/evaluation_midl_camera_fairness_pneumothorax.pdf}
\end{minipage}
}

\end{tabular}

\caption{Equalized odds bias change pre- and post-mitigation compared to predictions on original images for CheXpert. Pre-mitigation (``Reconstruction''), bias tends to increase slightly for sex; race exhibits high variance. Bias tends to decline slightly post-mitigation. Error bars represent standard deviation.}
\label{fig:bias_chexpert}

\end{figure*}

\begin{table}[ht]
\centering
  \begin{tabular}{l|ccc}
  \hline
                          & \multicolumn{1}{c|}{\textbf{Sex}} & \multicolumn{1}{c|}{\textbf{Age}} & {\textbf{Race}} \\ \hline
  \textbf{Classification} & $0.05 \pm 0.07$ & $0.17 \pm 0.08$ & $0.07 \pm 0.03$ \\ \hline
  \textbf{Segmentation}   & $1.10$ & $1.22$ & -- \\
  \hline
  \end{tabular}
  \caption{Baseline fairness of the classifiers (EODD) and the segmentation model (SER) for different sensitive attributes. For classification, mean and s.d. are reported across all classification tasks. Segmentation corresponds to UCSF-PDGM performance. Sex exhibits the lowest baseline bias.}
  \label{tab:baseline_bias}
\end{table}

Fig.~\ref{fig:histogram} displays the distribution of bias shifts when reconstructed images replace the original inputs. To provide a global overview, the histogram represents the bias shifts across all tasks, pathologies, and reconstruction models. As the diagnostic models exhibit bias on the original inputs (Table~\ref{tab:baseline_bias}), the bias shifts with reconstruction are plotted on a percentage scale compared to the original bias to highlight the relative effects. We find that the mode of these shifts is centered around zero, indicating little bias change in most instances. However, there is a noticeable tail towards positive bias changes, especially for sex, which exhibits a median increase of $24\%$. This is partly attributable to sex having a lower baseline bias than age and race (Table~\ref{tab:baseline_bias}).

The bias changes for each pathology and model are provided in Figures~\ref{fig:bias_chexpert} and~\ref{fig:bias_ucsf_class} (represented by the ``Reconstruction'' value in each plot).  
For UCSF-PDGM, no significant fairness deviations were observed when using the reconstructed images compared to the original images for either the segmentation or classification tasks (Figure~\ref{fig:bias_ucsf_class}). CheXpert shows more frequent significant shifts. Out of the 36 combinations (12 pathologies x 3 reconstruction models), there were 8 significant changes for sex (all in the positive direction) and 12 significant changes for age (4 in the positive direction). Due to large error bars, there were 0 significant changes for race, but alternative analysis which excluded subgroups with small sample sizes did reveal some significant changes (Appendix \ref{app:fairness_results}). Overall, the pathology-level findings support the histogram trend with a slight bias increase for sex and a slight decrease for age. The absolute magnitude of the effects were generally modest; however, some are of the order of a 0.05 change in EODD, corresponding to a 5\% difference in sensitivity/specificity, which is meaningful at the population level. Across reconstruction methods, the GAN and SDE-based models exhibited smaller bias shifts than the traditional U-Net (Table \ref{tab:model_bias}).

\begin{table}[ht]
\centering
  \begin{tabular}{l|ccc}
  \hline
                          & \multicolumn{1}{c|}{\textbf{U-Net}} & \multicolumn{1}{c|}{\textbf{GAN}} & {\textbf{SDE}} \\ \hline
  \textbf{Median}          & 2.28 & -0.21 & 1.59 \\ \hline
  \textbf{Absolute Median}            & 14.6 & 11.8 & 11.5 \\
  \hline
  \end{tabular}
    \caption{Median of bias change (\% change in EODD/SER) by reconstruction approach across all datasets, tasks, and attributes by model. SDE and GAN show a slightly smaller bias shift than U-Net.}  \label{tab:model_bias}
\end{table}


\subsection{Bias Mitigation}
While the impact of reconstruction on fairness was generally modest, applying mitigation strategies at the reconstruction stage could still reduce these effects or even improve the fairness of the underlying diagnostic models. We therefore tested two mitigation techniques inspired by classification literature but applied exclusively during reconstruction model training: sample reweighting and an equalized odds (EODD) constraint.

\begin{table}[ht]
\centering
  \begin{tabular}{l|ccc}
  \hline
                          & \multicolumn{1}{c|}{\textbf{Sex}} & \multicolumn{1}{c|}{\textbf{Age}} & {\textbf{Race}} \\ \hline
  \textbf{Standard}          & 24.1 & -1.88 & 3.30 \\ \hline
  \textbf{Reweighted}        & 10.6 & 0.03 & 1.05 \\ \hline
  \textbf{EODD}              & 7.56 & -2.01 & 0.52 \\ 
  \hline
  \end{tabular}
    \caption{Median bias change (\% change in EODD/SER) by mitigation strategy across all datasets, tasks, and models. Standard corresponds to the original results without mitigation applied. }  \label{tab:mitigation_bias}
\end{table}

Table \ref{tab:mitigation_bias} summarizes the bias changes for the mitigated models compared to the standard models. The summary is presented as an aggregation over pathologies and reconstruction model types, with results for each combination presented in 
Figures \ref{fig:bias_chexpert} and \ref{fig:bias_ucsf_class}. 
Sex-related biases see the most substantial percentage improvements for both mitigation strategies.  The EODD mitigation approach exhibited slightly lower median bias for each sensitive attribute than the reweighted sampling strategy (Table \ref{tab:mitigation_bias}). For CheXpert the improved fairness for sex was  most notable for the U-Net and SDE models, and less so for the GAN-based Pix2Pix model (Figure \ref{fig:bias_chexpert}). For UCSF-PDGM segmentation, EODD reduced bias for most attributes and models, most strongly for U-Net (Figure~\ref{fig:bias_ucsf_class}). Classification fairness on UCSF-PDGM exhibits no consistent pattern, with fluctuations in both directions. Overall, while some fairness improvements are observed, the magnitudes are modest compared to the original bias (e.g., 16.5\% median improvement for sex with EODD mitigation) and can depend on the pathology and sensitive attribute. 

\begin{table}[ht]
\centering
  \setlength{\tabcolsep}{1mm}   % tighter column spacing
  \begin{tabular}{l|cc|cc} %|cc}
    \hline
            & \multicolumn{2}{c|}{\textbf{Reweighted}} & \multicolumn{2}{c}{\textbf{EODD}}  \\ \cline{2-5}
            & \textbf{Chex} & \textbf{UCSF} & \textbf{Chex} & \textbf{UCSF}  \\ \hline
    \textbf{PSNR}  & 0.54 & -0.75 & -0.64 & -7.28  \\ \hline
    \textbf{Down.}   & 0.07 & -1.97 &  0.02 & -2.94  \\ \hline
  \end{tabular}
  \caption{Mean change (\%) in PSNR and downstream performance (AUROC/Dice) per dataset after each mitigation averaged over reconstruction models and tasks. Performance drops are modest, except for PSNR in UCSF-PDGM for EODD.}
  \label{tab:fairness_perf}
\end{table}

Fairness gains can incur performance trade-offs, but the trade-offs observed here are modest. Table~\ref{tab:fairness_perf} reports the mean change in PSNR and downstream task performance across reconstruction models when the mitigation strategies are applied. CheXpert deviations are below \(1\,\%\) for PSNR and downstream AUROC. Downstream performance in UCSF-PDGM is also only moderately affected by the mitigation strategies, though PSNR shows larger drops with EODD (see Figure~\ref{app:mitigation_performance} in the Appendix). Reweighting incurs the smallest penalties overall.

Additional results using EOP and \(\Delta\)Dice fairness metrics before and after mitigation are provided in the Appendix (Figures~\ref{fig:bias_chexpert_eop} and~\ref{fig:bias_ucsf_class_eop}) and support the trends described above.