Leveraging Image-to-Text Generators in Multimodal Vision Transformers for Inclusive Skin Cancer Diagnosis: A Comparative Study

Published: 19 Aug 2025, Last Modified: 12 Oct 2025BHI 2025EveryoneRevisionsBibTeXCC BY 4.0
Confirmation: I have read and agree with the IEEE BHI 2025 conference submission's policy on behalf of myself and my co-authors.
Keywords: skin cancer detection, vision transformers, multimodal learning, vision-language models
TL;DR: Fusing lesion images with a lightweight skin tone text in a Vision Transformer reduces skin tone bias and increases skin cancer detection in a limited and unbalanced dataset with minimal computational overhead.
Abstract: AI models for skin cancer diagnosis often underperform on darker skin tones due to imbalanced training datasets that predominantly feature lighter skin. In this study, we investigate whether lightweight, textual input can mitigate this disparity in a low-data setting. We use a dataset of only 4,311 clinical dermatology images—3,900 from lighter skin tones and just 411 from darker tones—to train Vision Transformers (ViTs) enhanced with text input including skin tone and generated lesion descriptions from Gemini and MONET. These textual inputs are fused with visual features via late fusion strategies. Among all configurations, ViT-B/32 combined with BERT-encoded skin tone using Element-Wise Fusion achieved the most balanced results, with AUCs of 0.822 (light) and 0.825 (dark), and matched accuracies of 0.823. This setup reduced the AUC gap to 0.003 and the accuracy gap to 0.0001. Our findings show that incorporating simple and domain-specific textual input can substantially reduce skin tone bias in ViT-based diagnosis offering a practical solution for building fairer medical AI.
Track: 3. Imaging Informatics
Registration Id: SXNLMMS4R6T
Submission Number: 64
Loading