Abstract: Fairness is a crucial consideration in medical deep learning, as model bias can lead to disparities in diagnoses and treatment decisions. Luo et al. (2024a) conducted a comprehensive fairness analysis of two vision-language models, CLIP and BLIP2, revealing significant bias in their predictions. The authors introduced FairCLIP, a model that mitigates bias and achieves a better performance-fairness trade-off. In this work, we aim to (1) reproduce the key findings of Luo et al. (2024a) and (2) extend their analysis with additional evaluations. Our results confirm that most of the reported findings are reproducible, although we identify discrepancies in specific cases. Furthermore, we conduct a more extensive fairness analysis by incorporating two additional metrics: Precision Disparity and Mean Absolute Deviation. Following this analysis, we confirm the presence of bias in CLIP. However, despite being able to reproduce most of the results, we challenge the claim that FairCLIP improves fairness. Our results suggest that improvements of FairCLIP over CLIP are inconsistent and architecture- or attribute-dependent, rather than a generalizable improvement in fairness. Finally, we conduct a study to identify the source of bias. Our results indicate that the bias does not originate from the summarized clinical notes, medical pre-training or group imbalance.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: We have made the following changes:
- We have more clearly emphasized our findings in the Introduction.
- We added a flowchart of the original proposed method, FairCLIP, as an illustration.
- We provided a visualization providing more details on the distribution of demographic data.
- We more clearly articulated our motivation for including PD and its complementarity to DEOdds.
- We made the denominator more explicit and added the percent signs for the ES-AUC example in Section 4.2 under the subheading “ES-AUC as Fairness Metric.”
- We have rerun the attribute prediction experiment and only focussed on the gender attribute.
- We added a more comprehensive discussion of attribute prediction in Section 3.3.5 and 4.4.
- We added a discussion of investigating the source of bias for future work in Section 5.2.
- We have elaborated on the alternative bias mitigation strategies in Section 5.2.
Assigned Action Editor: ~changjian_shui1
Submission Number: 4281
Loading