Abstract: The rapid development of LLMs has brought powerful text generation capabilities, leading to significant improvements in image captioning tasks. Addressing the challenges in medical domains, such as limited data availability, complex recognition requirements, and difficult manual annotation, we innovatively extend image captioning to CBCT-based dentition defect diagnosis tasks. Unlike traditional approaches that use semantic segmentation or object detection methods to locate missing teeth, our method only requires standard CBCT images (both with or without missing teeth) as input. Through image-text combined instruction-tuning with our model that integrates CLIP and SAM into BLIP2, we can successfully extract missing tooth location information from CBCT images and provide assessments in textual form. This greatly enhances the ability to reveal clinical information and provides valuable diagnostic assistance to doctors. In terms of performance, our method outperforms both MSMedCap, which is specifically designed for medical imaging, and InstructBLIP, which is trained on general datasets. We have achieved state-of-the-art results in our pioneering approach of using image captioning for dentition defect diagnosis. The key raw data has been uploaded to Research Data Deposit (www.researchdata.org.cn), validating the authenticity of this paper with the RDD number:
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: generative models,few-shot learning,healthcare applications,clinical NLP,biomedical QA,cross-modal information extraction
Contribution Types: NLP engineering experiment, Approaches to low-resource settings, Publicly available software and/or pre-trained models, Data resources
Languages Studied: English,Chinese
Submission Number: 279
Loading