Explainable, Multimodal Wound Infection Classification from Images Augmented with Generated Captions

Reza Saadati Fard

Published: 20 Mar 2026, Last Modified: 20 Apr 2026OpenReview Archive Direct UploadEveryoneCC BY 4.0

Abstract: Infections in Diabetic Foot Ulcers (DFUs) can cause severe complications, including tissue death and limb amputation, highlighting the need for accurate, timely diagnosis. Previous machine learning methods have focused on identifying infections by analyzing wound images alone, without utilizing additional metadata such as medical notes. In this study, we aim to improve infection classification by introducing Synthetic Caption Augmented Retrieval for Wound Infection Detection (SCARWID), a novel multimodal deep learning framework that leverages synthetic textual descriptions to augment DFU images. SCARWID consists of two components: (1) Wound-BLIP, a Vision-Language Model (VLM) fine-tuned on GPT-4o-generated descriptions to synthesize consistent captions from images; and (2) an Image–Text Fusion module that uses cross-attention to extract cross-modal embeddings from an image and its corresponding Wound-BLIP caption. Infection status is determined by retrieving the top- similar items from a labeled support set. To enhance the diversity of training data, we utilized a latent diffusion model to generate additional wound images. As a result, SCARWID outperformed state-of-the-art models, achieving average accuracy, sensitivity, and specificity of 0.814, 0.845, and 0.783, respectively, for wound infection classification.