Abstract: Despite significant advances in image-text medical visual language modeling, the high cost of fine-grained annotation of images to align radiology reports has led current approaches to focus primarily on semantic alignment between the image and the full report, neglecting the critical diagnostic information contained in the text. This is insufficient in medical scenarios demanding high explainability. To address this problem, in this paper, we introduce radiology reports as images in prompt learning. Specifically, we extract key clinical concepts, lesion locations, and positive labels from easily accessible radiology reports and combine them with an external medical knowledge base to form fine-grained self-supervised signals. Moreover, we propose a novel Report-Concept Textual-Prompt Learning (RC-TPL), which aligns radiology reports at multiple levels. In the inference phase, report-level and concept-level prompts provide rich global and local semantic understanding for X-ray images. Extensive experiments on X-ray image datasets demonstrate the superior performance of our approach with respect to various baselines, especially in the presence of scarce imaging data. Our study not only significantly improves the accuracy of data-constrained medical X-ray diagnosis, but also demonstrates how the integration of domain-specific conceptual knowledge can enhance the explainability of medical image analysis. The implementation code will be publicly available.
Primary Subject Area: [Content] Multimodal Fusion
Secondary Subject Area: [Generation] Multimedia Foundation Models, [Content] Vision and Language, [Content] Media Interpretation
Relevance To Conference: This work contributes to multimedia/multimodal processing by introducing a novel approach that aligns radiology reports at multiple levels with X-ray images. The proposed Report-Concept Textual-Prompt Learning (RC-TPL) method combines key clinical concepts, lesion locations, and positive labels extracted from radiology reports with an external medical knowledge base to form fine-grained self-supervised signals. This approach improves the accuracy of data-constrained medical X-ray diagnosis and enhances the explainability of medical image analysis by providing rich global and local semantic understanding for X-ray images. Overall, this work demonstrates the effectiveness of integrating domain-specific conceptual knowledge in multimodal medical image analysis.
Submission Number: 4758
Loading