Keywords: Vision-Language Models, Multi-Modal Reasoning, Burn Depth Assessment, Medical Imaging, Decision-making in Healthcare
TL;DR: This paper introduces a multi-modal framework for burn depth assessment that combines ultrasound with structured reasoning in vision-language models, improving interpretability and achieving higher accuracy than baseline models and expert surgeons.
Abstract: Ultrasound and other medical imaging data hold significant promise for burn depth assessment but remain underutilized in clinical workflows due to limited data availability, interpretive complexity, and the absence of standardized integration. Vision-language models (VLMs) have demonstrated impressive general-purpose capabilities across image and text domains, but they struggle to generalize to medical imaging modalities such as ultrasound, which are largely absent from pretraining corpora and represent a fundamentally different form of data. We present a framework for fine-grained burn depth assessment that combines digital photographs with ultrasound data, guided by structured vision-language reasoning. A central component of our method is the use of structured diagnostic hypotheses that describe clinical findings relevant to burn severity. These hypotheses can be provided by expert surgeons or automatically generated using large language models through a controlled prompting process. The reasoning process is further supported by symbolic consistency checks and chain-of-thought logic to align hypotheses with visual features, enhancing both interpretability and diagnostic performance. Our results show that the proposed method, when guided by structured reasoning, achieves higher diagnostic accuracy in burn depth assessment compared to base vision-language models without structured guidance. Importantly, the proposed system surpasses the diagnostic accuracy of expert surgeons using traditional assessment methods. This work demonstrates how multi-modal fusion and structured reasoning can enhance the explainability and accuracy of vision-language models in high-stakes medical applications.
Submission Number: 76
Loading