Keywords: Electronic Health Records (EHR), Structured Healthcare Data, Clinical Prediction, Longitudinal Patient Modeling, Mixture-of-Experts, Scalable Machine Learning
TL;DR: A hybrid dense–Mixture-of-Experts transformer pretrained on 50M EHRs improves zero-shot clinical prediction across health systems while reducing per-token compute.
Abstract: Structured electronic health records (EHRs) are a natural substrate for healthcare foundation models, but dense transformers remain expensive to scale across heterogeneous code vocabularies, irregular longitudinal records, and very large patient populations. We present SparseEHR, a hybrid dense-to-sparse transformer for structured EHR sequences that uses dense warm-start layers followed by mixture-of-experts (MoE) layers with top-2 routing and a shared expert pathway. SparseEHR is pretrained on longitudinal diagnosis and procedure sequences from approximately 50 million de-identified individuals in the OptumLabs Data Warehouse. In strictly zero-shot transfer to MIMIC-IV, without any fine-tuning, SparseEHR achieves 0.463 Recall@10 and 0.551 Recall@20 for next-visit ICD-10 prediction, out performing recent public baselines. The selected hybrid configuration also reduces active parameters per token from 530M to 470M and training step time from 1.889s to 1.682s relative to an all-MoE variant, showing that conditional computation can improve transfer while lowering per-token compute for structured health data.
Submission Number: 34
Loading