Exploiting Negative Samples: A Catalyst for Cohort Discovery in Healthcare Analytics

Kaiping Zheng; Horng-Ruey Chua; Melanie Herschel; H. Jagadish; Beng Chin Ooi; James Wei Luen Yip

Exploiting Negative Samples: A Catalyst for Cohort Discovery in Healthcare Analytics

Kaiping Zheng, Horng-Ruey Chua, Melanie Herschel, H. Jagadish, Beng Chin Ooi, James Wei Luen Yip

11 May 2023 (modified: 12 Dec 2023)Submitted to NeurIPS 2023EveryoneRevisionsBibTeX

Keywords: Negative Samples, Cohort Discovery, Healthcare Analytics

TL;DR: In this paper, we bridge the research gap caused by the asymmetry between positive and negative samples in healthcare analytics by exploring negative samples for cohort discovery.

Abstract: Healthcare analytics, particularly binary diagnosis or prognosis problems, present unique challenges due to the inherent asymmetry between positive and negative samples. While positive samples, representing patients who develop a disease, are defined through rigorous medical criteria, negative samples are defined in an open-ended manner, resulting in a vast potential set. Despite this fundamental asymmetry, previous research has underexplored the role of negative samples, possibly due to the enormous challenge of investigating an infinitely large negative sample space. To bridge this gap, we propose an approach to facilitate cohort discovery within negative samples, which could yield valuable insights into the studied disease, as well as its comorbidity and complications. We measure each sample’s contribution using data Shapley values and construct the Negative Sample Shapley Field to model the distribution of all negative samples. Then we transform this field via manifold learning, preserving the data structure information while imposing an isotropy constraint in data Shapley values. Within this transformed space, we identify cohorts of medical interest through density-based clustering. We empirically evaluate the effectiveness of our approach on our hospital’s electronic medical records. The medical insights revealed in the discovered cohorts are validated by clinicians, which affirms the medical value of our proposal in unveiling meaningful insights consistent with existing domain knowledge, thereby bolstering medical research and well-informed clinical decision-making.

Supplementary Material: zip

Submission Number: 9772

Loading