Keywords: Fairness, Calibration, Bias
Abstract: Vision-language models (VLMs) are increasingly deployed in dermatology, yet their calibration across patient demographics remains understudied. We evaluate GPT-4o on binary skin lesion classification using the Diverse Dermatology Images dataset, finding that standard accuracy (70.1\%) substantially overstates diagnostic capability relative to balanced accuracy (60.5\%). Across three confidence extraction methods, we identify a calibration--equity tradeoff: verbalized confidence achieves the best aggregate calibration (ECE = 0.073) but the worst demographic disparity (dark skin 2.6$\times$ worse than light); self-consistency at high temperature is most equitable (max ECE gap = 0.009) but sacrifices discrimination; token-level probabilities offer the strongest discrimination (AUROC = 0.655) but severe overconfidence (21.5\% error rate at $>$99\% confidence). Post-hoc temperature scaling substantially improves token-level calibration (ECE reduced 80--96\%), yet the equity ranking across signals is preserved after recalibration. These findings show that confidence method selection has direct fairness implications for VLM-based clinical decision support. Code: https://github.com/sonnetx/demographic-calibration-aistats
Submission Number: 12
Loading