Keywords: electronic health records; active learning; decision making
TL;DR: Patient selection for bioclinical studies is a sequential decision: we show active learningl selecting patients that maximize model information gain, outperforms naive recruitment across 21 disease tasks, with largest gains for rare conditions.
Abstract: Building disease diagnosis models from elec-
tronic health records (EHRs) requires decid-
ing which patients to include in the train-
ing cohort, a decision that shapes predictive
accuracy, fairness, and clinical utility. We
formulate cohort construction as a sequen-
tial decision problem and evaluate unsuper-
vised active learning (AL) strategies as pa-
tient selection policies. We benchmark five
canonical AL strategies for cold-start EHR
disease diagnosis using MOTOR foundation
model embeddings, across 21 disease tasks
on MIMIC-IV with three classifiers (logis-
tic regression, MLP, XGBoost). Entropy
sampling is the only strategy that consis-
tently outperforms random across all classi-
fiers, with gains largest for low-prevalence
diseases. Diversity-based strategies perform
similarly or worse than random because they
systematically under-enroll positive-class pa-
tients in imbalanced cohorts, regardless of
class separability in the embedding space.
Submission Number: 37
Loading