Pretraining Patient Foundation Models on Multimodal Patient Journeys

Daniel P Jeong; Suhana Bedi; Cliff Wong; Sheng Zhang; Jeya Maria Jose Valanarasu; Qianchu Liu; Reuben Tan; Zelalem Gero; Jaspreet Bagga; Juan Manuel Zambrano Chaves; Naoto Usuyama; Hanwen Xu; Roshanthi K Weerasinghe; Rom Leidner; Brian Piening; carlo bifulco; Tristan Naumann; Hoifung Poon

Pretraining Patient Foundation Models on Multimodal Patient Journeys

Daniel P Jeong, Suhana Bedi, Cliff Wong, Sheng Zhang, Jeya Maria Jose Valanarasu, Qianchu Liu, Reuben Tan, Zelalem Gero, Jaspreet Bagga, Juan Manuel Zambrano Chaves, Naoto Usuyama, Hanwen Xu, Roshanthi K Weerasinghe, Rom Leidner, Brian Piening, carlo bifulco, Tristan Naumann, Hoifung Poon

Published: 23 Sept 2025, Last Modified: 18 Oct 2025TS4H NeurIPS 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multimodal EHR Foundation Model, Scaling Law, Curriculum Pretraining, Foundation Models for Precision Oncology

Abstract: Precision medicine requires integrating diverse data modalities such as structured electronic health records (EHRs), radiology images, digital pathology, and genomics to guide treatment decisions. Yet, many existing foundation models for longitudinal patient data remain unimodal, trained solely on structured codes or imaging modalities, which limits their clinical utility for highly complex conditions like cancer. In this work, we present the first scaling-law study for foundation models pretrained on multimodal patient journeys, using longitudinal structured records, CT scans, and whole-slide histopathology images from 2.3M cancer patients. We train our models (MEHRT) to simultaneously process all modalities recorded across time via a multi-stage curriculum pretraining strategy, and introduce a new evaluation suite of six oncology prediction tasks (e.g., progression-free survival, metastasis) carefully defined with oncologists. MEHRT consistently outperforms state-of-the-art supervised baselines, e.g., achieving a +7% average improvement in AUROC over the best-performing baseline (CatBoost), and its performance scales with model size. When compared to its unimodal counterpart (EHRT), MEHRT shows modest yet consistent improvements in predictive accuracy and generative modeling capabilities, suggesting that multimodality can complement scaling. Finally, we discuss important limitations and practical lessons learned that inform future development of multimodal EHR foundation models.

Submission Number: 92

Loading