Unsupervised discovery of clinical disease signatures using probabilistic independence
Abstract: Objective: This study uses probabilistic independence to disentangle patient-specific sources of disease and their
signatures in Electronic Health Record (EHR) data.
Materials and Methods: We model a disease source as an unobserved root node in the causal graph of observed
EHR variables (laboratory test results, medication exposures, billing codes, and demographics), and a signature
as the set of downstream effects that a given source has on those observed variables. We used probabilistic independence to infer 2000 sources and their signatures from 9195 variables in 630, 000 cross-sectional training
instances sampled at random times from 269,099 longitudinal patient records. We evaluated the learned sources
by using them to infer and explain the causes of benign vs. malignant pulmonary nodules in 13,252 records,
comparing the inferred causes to an external reference list and other medical literature. We compared models
trained by three different algorithms and used corresponding models trained directly from the observed variables
as baselines.
Results: The model recovered 92% of malignant and 30% of benign causes in the reference standard. Of the top 20
inferred causes of malignancy, 14 were not listed in the reference standard, but had supporting evidence in the
literature, as did 11 of the top 20 inferred causes of benign nodules. The model decomposed listed malignant
causes by an average factor of 5.5 and benign causes by 4.1, with most stratifying by disease course or treatment
regimen. Predictive accuracy of causal predictive models trained on source expressions (Random Forest AUC
0.788) was similar to (p = 0.058) their associational baselines (0.738).
Discussion: Most of the unrecovered causes were due to the rarity of the condition or lack of sufficient detail in the
input data. Surprisingly, the causal model found many patients with apparently undiagnosed cancer as the source
of the malignant nodules. Causal model AUC also suggests that some sources remained undiscovered in this
cohort.
Conclusion: These promising results demonstrate the potential of using probabilistic independence to disentangle
complex clinical signatures from noisy, asynchronous, and incomplete EHR data that represent the confluence of
multiple simultaneous conditions, and to identify patient-specific causes that support precise treatment decisions.
Loading