Horizon-Aware Vision–Language Forecasting of Diabetic Retinopathy with Text Prototypes

Published: 06 Oct 2025, Last Modified: 06 Oct 2025NeurIPS 2025 2nd Workshop FM4LS PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Vision–Language Model, Longitudinal risk prediction, Text prototypes
Abstract: Identifying eyes at high risk of diabetic retinopathy (DR) progression is essential for timely intervention and prevention of vision loss. We introduce the first vision–language model (VLM) designed to forecast DR progression using paired fundus images and structured narrative prompts. Each prompt encodes demographic and clinical information—such as age, gender, and eye laterality—which is combined with retinal photographs to predict the likelihood of referable DR within 1-, 2-, or 3-year horizons. Our framework employs a vision transformer for image encoding and BioClinicalBERT for text encoding, with multimodal representations aligned through a contrastive learning objective. Experiments on a large national screening dataset show that incorporating demographic context consistently improves predictive accuracy compared to image-only models. At the one-year horizon, for example, AUROC increased from 0.654 to 0.683 and accuracy from 0.608 to 0.645 when age and gender were included. These findings establish a simple yet effective multimodal baseline, demonstrating that demographic-aware prompts provide complementary prognostic value beyond fundus imaging alone. This work pioneers the application of vision–language modeling to DR progression forecasting and offers a reproducible foundation for future multimodal and longitudinal research.
Submission Number: 37
Loading