Pretraining EHR Foundation Models with Patient-Aware Sampling

Published: 23 May 2026, Last Modified: 23 May 2026SD4H ICML 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: electronic health records, EHR foundation models, autoregressive modeling, patient trajectories, pretraining, sequence construction, patient-aware sampling, clinical prediction, MIMIC-IV, healthcare AI
TL;DR: Patient-aware sampling improves autoregressive EHR foundation model pretraining by changing how training signal is distributed across patients.
Abstract: Autoregressive foundation models for electronic health records (EHRs) typically inherit pretraining methods from language modeling, where patient trajectories are concatenated into a global token stream and windows are sampled from that stream. In EHR data, this choice is consequential: windows may mix multiple patients, and patients with longer records contribute more optimization updates, potentially biasing learning toward long trajectories. We propose alternative pretraining sequence construction methods, focusing on how training signal is distributed across patients. Specifically, we compare Global Stream, deterministic Patient Chunks, and stochastic Patient Sampling with controllable weighting. Across downstream clinical tasks on MIMIC-IV v2.2 and v3.1, Patient Sampling improves Macro AUROC and AUPRC over the Global Stream baseline. These results identify training and validation sequence construction as important and underexplored design choices for autoregressive EHR foundation models.
Submission Number: 48
Loading