Contrastive Patient-level Pretraining Enables Longitudinal and Multimodal Fusion for Lung Cancer Risk Prediction

Thomas Li; Lianrui Zuo; Yihao Liu; Aravind Krishnan; Kim L. Sandler; Thomas A Lasko; Fabien Maldonado; Bennett Allan Landman

Contrastive Patient-level Pretraining Enables Longitudinal and Multimodal Fusion for Lung Cancer Risk Prediction

Thomas Li, Lianrui Zuo, Yihao Liu, Aravind Krishnan, Kim L. Sandler, Thomas A Lasko, Fabien Maldonado, Bennett Allan Landman

Published: 27 Mar 2025, Last Modified: 02 Jun 2025MIDL 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: contrastive language-image pretraining (CILP), multimodal, chest CT, lung cancer

TL;DR: contrastive pretraining enhances the fusion of longitudinal and multimodal medical data without requiring semantically paired modalities or additional training examples.

Abstract: Leveraging longitudinal and multimodal data is important for clinical predictive tasks. Contrastive language-image pretraining (CLIP) has been successful in learning multimodal representations by aligning paired images and captions, i.e. medical images and corresponding radiology report. However, in real clinical settings, the alignment of unpaired modalities, such as medical images and clinical notes collected at different times, is an open challenge, even though such data are ubiquitous in practice. This study conducts contrastive pretraining between longitudinal chest CTs and clinical variables on the patient level using a large public lung cancer screening dataset. Leveraging a time-distanced transformer to encode longitudinal imaging and an open-source text embedding to encode clinical variables, we optimize contrastive loss between the embedded modalities from same patient (positive pair) against those from different patients (negative pair). We find that finetuning the CLIP representation significantly improves prediction of lung cancer risk in two types of clinical populations (0.895 and 0.893 AUC) compared to conventional multimodal fusion (0.873 and 0.875 AUC) and single modality baselines. These results demonstrate how contrastive patient-level pretraining can enable longitudinal and multimodal fusion without additional training data. We released our code and pre-trained weights at https://github.com/MASILab/lung-cplp.

Primary Subject Area: Integration of Imaging and Clinical Data

Secondary Subject Area: Unsupervised Learning and Representation Learning

Paper Type: Validation or Application

Registration Requirement: Yes

Reproducibility: https://github.com/MASILab/lung-cplp

Midl Latex Submission Checklist: Ensure no LaTeX errors during compilation., Created a single midl25_NNN.zip file with midl25_NNN.tex, midl25_NNN.bib, all necessary figures and files., Includes \documentclass{midl}, \jmlryear{2025}, \jmlrworkshop, \jmlrvolume, \editors, and correct \bibliography command., Did not override options of the hyperref package, Did not use the times package., All authors and co-authors are correctly listed with proper spelling and avoid Unicode characters., Author and institution details are de-anonymized where needed. All author names, affiliations, and paper title are correctly spelled and capitalized in the biography section., References must use the .bib file. Did not override the bibliographystyle defined in midl.cls. Did not use \begin{thebibliography} directly to insert references., Tables and figures do not overflow margins; avoid using \scalebox; used \resizebox when needed., Included all necessary figures and removed *unused* files in the zip archive., Removed special formatting, visual annotations, and highlights used during rebuttal., All special characters in the paper and .bib file use LaTeX commands (e.g., \'e for é)., Appendices and supplementary material are included in the same PDF after references., Main paper does not exceed 9 pages; acknowledgements, references, and appendix start on page 10 or later.

Latex Code: zip

Copyright Form: pdf

Submission Number: 240

Loading