EHR Safari: Data is Contextual

William Boag

Published: 17 Aug 2022, Last Modified: 19 Apr 2024MLHC 2022EveryoneCC BY 4.0

Abstract: In the last decade, machine learning (ML) has shown tremendous success in areas such as vision, language, strategic games, and more. Parallel to this, hospitals’ capacity for data collection has greatly increased with the adoption and continuing maturation of electronic health records (EHRs). The result of these trends has been a large degree of excitement and optimism about how ML will revolutionize healthcare once researchers get access to data. In this work, we present a cautionary tale of the instinct some computer scientists have to “let the data speak for itself.” Using a popular, public EHR dataset as a case study, we demonstrate numerous examples where a non-clinician’s intuition may lead to incorrect – and potentially harmful – modeling assumptions. We explore both non-obvious quirks in the data (i.e., hypothetical incorrect assumptions) and examples of published papers that misunderstood the data generating process (i.e., actual incorrect assumptions). This case study is meant to serve as a cautionary tale to encourage every data scientist to approach their projects with the humility to know what they can do well and what they cannot. Without the guidance of stakeholders that understand the data generating process, data scientists run the risk of “garbage-in, garbage-out” analysis because their models are not measuring meaningful relationships.