Active Learning as Decision Support for EHR Cohort Construction

Published: 23 May 2026, Last Modified: 13 Jun 2026SD4H ICML 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: electronic health records; active learning; decision making
TL;DR: Patient selection for bioclinical studies is a sequential decision: we show active learningl selecting patients that maximize model information gain, outperforms naive recruitment across 21 disease tasks, with largest gains for rare conditions.
Abstract: Building disease diagnosis models from elec- tronic health records (EHRs) requires decid- ing which patients to include in the train- ing cohort, a decision that shapes predictive accuracy, fairness, and clinical utility. We formulate cohort construction as a sequen- tial decision problem and evaluate unsuper- vised active learning (AL) strategies as pa- tient selection policies. We benchmark five canonical AL strategies for cold-start EHR disease diagnosis using MOTOR foundation model embeddings, across 21 disease tasks on MIMIC-IV with three classifiers (logis- tic regression, MLP, XGBoost). Entropy sampling is the only strategy that consis- tently outperforms random across all classi- fiers, with gains largest for low-prevalence diseases. Diversity-based strategies perform similarly or worse than random because they systematically under-enroll positive-class pa- tients in imbalanced cohorts, regardless of class separability in the embedding space.
Submission Number: 37
Loading