\section{Discussion}

We developed and applied an analysis framework that integrates reconstruction and prediction models to evaluate the effects of image reconstruction on downstream clinical performance and fairness, and investigate bias mitigation strategies at the reconstruction stage. Our analysis revealed several important insights, as summarized below.

\paragraph{Stability of Downstream Performance:} Despite notable reductions in image quality, indicated by decreased PSNR at higher noise levels, downstream segmentation and classification performances remained robust to image reconstruction. This stability suggests that current diagnostic models are largely resilient to reconstruction-induced image degradations, at least for the studied tasks, which implies that minor reconstruction noise might not adversely impact clinical diagnostic accuracy. This finding may be surprising given that deep learning classification models are often thought to lack robustness, such as showing changes if the data are heterogeneous or noisy \cite{biomedinformatics4020050}. This suggests a nuanced interpretation of robustness, where models may be robust to certain transformations (e.g., reconstruction noise) but not others. Critically, these findings also have implications for the studied reconstruction models. Even if the downstream models were robust, we would expect that the performance of these models would drop if the reconstruction models removed the true underlying information necessary to perform the tasks. Instead, we observe largely stable downstream performance even as PSNR decreases, suggesting retainment of diagnostic features for the studied tasks. Nonetheless, there was a mild dependence on task difficulty, where more subtle pathologies (e.g. lung lesions) showed larger performance effects, highlighting future opportunities of applying our framework to other subtle tasks where the implications of AI-based reconstruction are currently unknown. 

\FloatBarrier

% Fairness plots - UCSF-PDGM only
\begin{figure*}[!t]
\centering

\includegraphics[width=\textwidth]{plots/fairness/eodd/evaluation_midl_camera_fairness_legend.pdf}

\vspace{-10pt}

\begin{minipage}[t]{0.32\textwidth}
\centering
\subfigure[]{%
\includegraphics[width=\linewidth]{plots/fairness/eodd/evaluation_midl_camera_fairness_dice.pdf}
}
\end{minipage}\hfill
\begin{minipage}[t]{0.32\textwidth}
\centering
\subfigure[]{%
\includegraphics[width=\linewidth]{plots/fairness/eodd/evaluation_midl_camera_fairness_tgrade.pdf}
}
\end{minipage}\hfill
\begin{minipage}[t]{0.32\textwidth}
\centering
\subfigure[]{%
\includegraphics[width=\linewidth]{plots/fairness/eodd/evaluation_midl_camera_fairness_ttype.pdf}
}
\end{minipage}
\vspace{-10pt}
\caption{EODD and SER bias change pre- and post-mitigation compared to predictions on original images for UCSF-PDGM tasks. No consistent trends emerge for the classification tasks. Error bars represent standard deviation.}
\label{fig:bias_ucsf_class}

\end{figure*}

\paragraph{Fairness Implications and Variability:} The aggregate effect of reconstruction on fairness was relatively modest, though certain pathologies and sensitive attributes showed significant shifts. These shifts varied in magnitude and direction, with a tendency toward increased bias, especially for patient sex. In most cases, the magnitude represented only a small fraction of the bias already present in the diagnostic models, though some would correspond to a \textasciitilde5\% difference in sensitivity/specificity between subgroups. Thus, reconstruction can contribute to bias in downstream tasks, but the overall bias appears to be largely driven by the downstream models themselves. 

\paragraph{Effectiveness of Mitigation Techniques:} Mitigation strategies reduced sex-related biases on CheXpert without measurable performance trade-offs in AUROC or PSNR (Figure~\ref{app:mitigation_performance} in the Appendix). However, similar mitigation strategies yielded inconsistent results on UCSF-PDGM, highlighting that their effectiveness is dataset-specific and dependent on the underlying task complexity and dataset characteristics.

\paragraph{Sensitivity of Model Choice:} 
The SDE and GAN-based reconstruction approaches introduced lower additional bias overall compared to the standard U-Net, which may be counterintuitive given the generative nature of the SDE and GAN models. The U-Net also exhibited larger degradations in downstream performance when fine-tuned with the fairness mitigation strategies (Figure \ref{app:mitigation_performance} in the Appendix). This sensitivity may arise from its inherently lower capacity than other methods, limiting simultaneous optimization of image fidelity and fairness constraints. 

\paragraph{Summary of Clinical Implications:}
The robustness of downstream performance to AI-based image reconstruction is encouraging, particularly as these technologies are increasingly integrated into clinical practice. However, some performance drops were observed, especially for more subtle pathologies, highlighting the importance of rigorous evaluation and real-world monitoring. The potential for fairness shifts also necessitates active monitoring and reporting. This is especially important because model behavior can change as data distributions shift.

\paragraph{Summary of Model Development Implications:}
Developers of reconstruction models should prioritize downstream task and fairness evaluations alongside traditional pixel-level metrics, recognizing that reconstruction-induced biases, though subtle, can propagate through diagnostic workflows. This is especially the case for patient sex, where anatomical differences can be more prominent and may explain the larger effects observed for this attribute in our results. Bias mitigation strategies applied at the reconstruction stage may help improve fairness, but our results suggest that direct intervention at the classifier stage should be prioritized. Future research should explore multi-stage bias mitigation, integrating reconstruction and classification levels to achieve balanced fairness and performance outcomes.

\paragraph{Limitations:} For comprehensiveness, we assessed multiple reconstruction models, downstream tasks, pathologies, and mitigation strategies, but this breadth necessarily creates challenges in data interpretation. As such, we have provided both summary level (e.g., Figure~\ref{fig:histogram}) and individual (e.g., Figure~\ref{fig:bias_chexpert}) results to enhance interpretability. Along with our studied datasets and tasks, it will be important in future work to apply our framework to additional datasets and clinical populations, including larger MRI datasets, to further probe generalization. 
Further generalization assessments for MRI should include additional sequences, directly modeling of k-space data rather than synthetic undersampling, and the inclusion of measurement noise rather than undersampling only.
We note that our framework relies on the existence of adequate downstream models, which may not exist for every intended clinical task and dataset. 
However, the framework is agnostic to the type of downstream model, and could be applied to volumetric or temporal models, or could rely on self-supervised models before fine-tuning when labels are scarce.
%our evaluation has limited site diversity. We rely on a single dataset for each modality, which might not adequately capture inhomogeneities across sites. Additionally, the reconstruction models and the downstream diagnostic classifiers were trained and tested on images from the same underlying data distribution. In clinical practice, models might be trained on independently acquired data, so real-world domain shifts may amplify or dampen the biases we observe. 
Additionally, while the algorithms used to create noisy images in this study simulate realistic acquisition degradations and are common approaches in the field \cite{FengRadial, Gibson2023APM}, they may not fully capture real-world variations.