TempA-VLP: Temporal-Aware Vision-Language Pretraining for Longitudinal Exploration in Chest X-Ray Image

Zhuoyi Yang, Liyue Shen

Published: 01 Jan 2025, Last Modified: 14 May 2025WACV 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Longitudinal medical image processing is a significant task to understand the dynamic changes of disease by taking and comparing image series over time, providing insights into how conditions evolve and enabling more accurate di-agnosis and treatment planning. While recent advance-ments in biomedical Vision-Language Pre-training (VLP) have enabled label-efficient representation learning with paired medical images and reports, existing methods pri-marily pair a single image with the corresponding textual report, limiting their ability to capture temporal relation-ships. To address this limitation, it is essential to learn temporal-aware cross-modal representations from sequen-tial medical images and text reports that highlight the tem-poral changes occurring between examinations. Specifi-cally, we introduce TempA- Vlp, a temporal-aware vision language pre-training framework with a cross-exam en-coder to integrate the information from both prior and cur-rent examinations. This approach enables the model to capture dynamic representations that reflect disease pro-gression over time, which allows us to (i) achieve state-of-the-art performance in disease progression classification, (ii) localize dynamic progression regions across consecutive examinations, as demonstrated in our new task, dynamic phrase grounding on the Chest-Imagenome Gold dataset, and (iii) highlight progression localized regions, often rele-vant to lesion areas, which in turn improves disease classi-fication tasks on a single image.