Vision-language foundation model for generalizable nasal disease diagnosis using unlabeled endoscopic records

Published: 01 Jan 2025, Last Modified: 25 Jul 2025Pattern Recognit. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Medical artificial intelligence (AI) holds significant potential in identifying signs of health conditions in nasal endoscopic images, thereby accelerating the diagnosis of diseases and systemic disorders. However, the performance of AI models heavily relies on expert annotations, and these models are usually task-specific with limited generalization performance across various clinical applications. In this paper, we introduce NasVLM, a Nasal Vision-Language foundation Model designed to extract universal representations from unlabeled nasal endoscopic data. Additionally, we construct a large-scale nasal endoscopic pre-training dataset and three downstream validation datasets from routine diagnostic records. The core strength of NasVLM lies in its ability to learn cross-modal semantic representations and perform multi-granular report-image alignment without depending on expert annotations. Furthermore, to the best of our knowledge, it is the first medical foundation model that effectively aligns medical report with multiple images of different anatomic regions, facilitated by a well-designed hierarchical report-supervised learning framework. The experimental results demonstrate that NasVLM has superior generalization performance across diverse diagnostic tasks and surpasses state-of-the-art self- and report-supervised methods in disease classification and lesion localization, especially in scenarios requiring label-efficient fine-tuning. For instance, NasVLM can distinguish normal nasopharynx (NOR) from abnormalities (benign hyperplasia, BH, and nasopharyngeal carcinoma, NPC) with an accuracy of 91.38% (95% CI, 90.59 to 92.17) and differentiate NPC from BH and NOR with an accuracy of 81.45% (95% CI, 80.21 to 82.67) on the multi-center NPC-Screen dataset using only 1% labeled data, on par with the performance of traditional supervised methods using 100% labeled data.
Loading